0% found this document useful (0 votes)
64 views821 pages

Comp Data Science

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views821 pages

Comp Data Science

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 821

...

Introduction to

Computational Data Science

Using ScalaTion

...

John A. Miller
Department of Computer Science
University of Georgia

...

February 10, 2024

1
2
Brief Table of Contents

1 Introduction to Data Science 33


1.1 Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.2 ScalaTion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.3 A Data Science Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
1.4 Additional Textbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

I Foundations 47

2 Linear Algebra 49
2.1 Linear System of Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.2 Matrix Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.3 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.4 Vector Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.5 Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.6 Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.7 Internal Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.8 Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.10 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3 Probability 69
3.1 Probability Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.2 Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3 Probability Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.4 Empirical Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.5 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.6 Algebra of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.7 Median, Mode and Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.8 Joint, Marginal and Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.9 Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.10 Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.11 Estimating Parameters from Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.12 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

3
3.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.14 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.15 Notational Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.16 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

4 Data Management 117


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.2 Relational Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.3 Columnar Relational Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.4 SQL-Like Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

5 Data Preprocessing 141


5.1 Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.2 Methods for Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.3 Imputation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.4 Align Multiple Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.5 Creating Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

II Modeling 149

6 Prediction 151
6.1 Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.2 Quality of Fit for Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.3 Null Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.4 Simpler Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.5 Simple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.6 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.7 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
6.8 Lasso Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.9 Quadratic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
6.10 Cubic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
6.11 Symbolic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
6.12 Transformed Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
6.13 Regression with Categorical Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
6.14 Weighted Least Squares Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
6.15 Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
6.16 Trigonometric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

7 Classification 257
7.1 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
7.2 Quality of Fit for Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
7.3 Null Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

4
7.4 Naı̈ve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
7.5 Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
7.6 Tree Augmented Naı̈ve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
7.7 Bayesian Network Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
7.8 Markov Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
7.9 Decision Tree ID3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
7.10 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
7.11 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

8 Classification: Continuous Variables 299


8.1 Gaussian Naı̈ve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
8.2 Simple Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
8.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
8.4 Simple Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
8.5 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
8.6 K-Nearest Neighbors Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
8.7 Decision Tree C45 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
8.8 Bagging Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
8.9 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
8.10 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
8.11 Neural Network Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331

9 Generalized Linear Models and Regression Trees 335


9.1 Generalized Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
9.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
9.3 Poisson Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
9.4 Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
9.5 Linear Model Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
9.6 Random Forest Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
9.7 Gradient Boosting Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
9.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
9.9 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351

10 Nonlinear Models and Neural Networks 353


10.1 Nonlinear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
10.2 Simple Exponential Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
10.3 Exponential Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
10.4 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
10.5 Multi-Output Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
10.6 Two-Layer Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
10.7 Three-Layer Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
10.8 Multi-Hidden Layer Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
10.9 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
10.101D CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418

5
10.112D CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
10.12Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
10.13Extreme Learning Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436

11 Time Series/Temporal Models 439


11.1 Forecaster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
11.2 Baseline Models: Random Walk, Null and Trend Models . . . . . . . . . . . . . . . . . . . . . 446
11.3 Simple Exponential Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
11.4 Auto-Regressive (AR) Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
11.5 Moving-Average (MA) Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
11.6 Auto-Regressive, Moving Average (ARMA) Models . . . . . . . . . . . . . . . . . . . . . . . . 473
11.7 Rolling-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
11.8 ARIMA (Integrated) Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
11.9 SARIMA (Seasonal) Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
11.10Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495

12 Multivariate and Nonlinear Time Series 497


12.1 Auto-Regressive with eXogenous variables (ARX) Models . . . . . . . . . . . . . . . . . . . . 498
12.2 SARIMAX Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
12.3 Vector Auto-Regressive (VAR) Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
12.4 Nonlinear Time Series Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
12.5 Recurrent Neural Networks (RNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
12.6 Gated Recurrent Unit (GRU) Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
12.7 Minimal Gated Unit (MGU) Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
12.8 Long Short Term Memory (LSTM) Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
12.9 Encoder-Decoder Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
12.10Transformer Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533

13 Dimensionality Reduction 539


13.1 Reducer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
13.2 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
13.3 Autoencoder (AE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543

14 Clustering 545
14.1 KNN Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
14.2 Clusterer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
14.3 K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
14.4 K-Means Clustering - Hartigan-Wong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
14.5 K-Means++ Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558
14.6 Clustering Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
14.7 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
14.8 Markov Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564

6
III Simulation 567

15 Simulation Foundations 569


15.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
15.2 Types of Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
15.3 Random Number Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
15.4 Random Variate Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
15.5 Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
15.6 Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
15.7 Hand Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
15.8 Tableau-Oriented Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599

16 State Space Models 603


16.1 Example: Trajectory of a Ball in One-Dimensional Space . . . . . . . . . . . . . . . . . . . . 604
16.2 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607
16.3 Dynamic Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620
16.4 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
16.5 Extended Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
16.6 ODE Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638

17 Event-Oriented Models 639


17.1 A Taxonomy/Ontology for Simulation Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 639
17.2 List Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642
17.3 Event Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645
17.4 Event Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 660
17.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664

18 Process-Oriented Models 667


18.1 Base Traits and Classes for Process-Oriented Models . . . . . . . . . . . . . . . . . . . . . . . 668
18.2 Concurrent Processing of Actors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
18.3 Process Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672
18.4 Agent-Based Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693
18.5 Animation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 700

19 Simulation Output Analysis 705


19.1 Point and Interval Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705
19.2 One-Shot Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707
19.3 Simulation Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 708
19.4 Method of Independent Replications (MIR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 711
19.5 Method of Batch Means (MBM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715
19.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 720

Appendices 721

7
A Optimization in Data Science 723
A.1 Partial Derivatives and Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724
A.2 Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 728
A.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734
A.4 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737
A.5 Stochastic Gradient Descent with Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . 740
A.6 SGD with ADAptive Moment Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744
A.7 Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746
A.8 Conjugate Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 748
A.9 Quasi-Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 750
A.10 Method of Lagrange Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755
A.11 Karush-Kuhn-Tucker Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757
A.12 Quadratic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758
A.13 Augmented Lagrangian Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 760
A.14 Alternating Direction Method of Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762
A.15 Nelder-Mead Simplex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765

B Graph Databases and Analytics 767


B.1 Directed Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 769
B.2 A Graph Database with Relational Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776
B.3 Property Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 779
B.4 Special Types of Graph Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 792
B.5 Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795
B.6 Exercises - Part I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 799
B.7 Graph Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801
B.8 Graph Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 802
B.9 Graph Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804
B.10 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806
B.11 Exercises - Part II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808

8
Contents

1 Introduction to Data Science 33


1.1 Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.2 ScalaTion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.2.1 Package Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.2.2 Scala 3 Control Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.2.3 Scala 3 Top-Level Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.2.4 Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.2.5 Basic Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
1.2.6 Collection Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.2.7 ScalaTion: Vectors, Matrices and Tensors . . . . . . . . . . . . . . . . . . . . . . . . 40
1.3 A Data Science Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
1.4 Additional Textbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

I Foundations 47

2 Linear Algebra 49
2.1 Linear System of Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.2 Matrix Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.3 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.3.1 Vector Addition and Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.3.2 Element-wise Multiplication and Division . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.3.3 Vector Dot Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.3.4 Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.3.5 Vector Operations in ScalaTion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.4 Vector Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.4.1 Gradient Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.4.2 Jacobian Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.4.3 Hessian Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.5 Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.5.1 Matrix Operation in ScalaTion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.6 Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.6.1 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.7 Internal Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

9
2.8 Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.8.1 Three Dimensional Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.8.2 Four Dimensional Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.10 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3 Probability 69
3.1 Probability Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.1.1 Joint Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.1.2 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.2 Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2.1 Discrete Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2.2 Continuous Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3 Probability Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.1 Cumulative Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.2 Probability Mass Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.3 Probability Density Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.4 Empirical Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.5 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5.1 Continuous Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5.2 Discrete Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5.4 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.6 Algebra of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.6.1 Expectation is a Linear Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.6.2 Variance is not a Linear Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.6.3 Convolution of Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.6.4 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.7 Median, Mode and Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.7.1 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.7.2 Quantile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.7.3 Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.8 Joint, Marginal and Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.8.1 Discrete Case: Joint and Marginal Mass . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.8.2 Continuous Case: Joint and Marginal Density . . . . . . . . . . . . . . . . . . . . . . . 86
3.8.3 Discrete Case: Conditional Mass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.8.4 Continuous Case: Conditional Density . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.8.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.8.6 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.8.7 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.9 Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.10 Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.11 Estimating Parameters from Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.11.1 Sample Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

10
3.11.2 Confidence Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.11.3 Estimation for Discrete Outcomes/Responses . . . . . . . . . . . . . . . . . . . . . . . 97
3.12 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.12.1 Positive Log Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.12.2 Joint Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.12.3 Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.12.4 Relative Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.12.5 Cross Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.12.6 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.12.7 Probability Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.14 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.15 Notational Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.16 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

4 Data Management 117


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.1.1 Analytics Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.1.2 The Tabular Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.2 Relational Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.2.1 Data Definition Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.2.2 Data Manipulation Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.2.3 Relational Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.2.4 Example Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.2.5 Persistence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.2.6 Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.2.7 Table Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.2.8 LTable Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.2.9 VTable Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.3 Columnar Relational Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.3.1 Data Definition Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.3.2 Data Manipulation Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
4.3.3 Columnar Relational Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
4.3.4 Example Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.3.5 Relation Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.4 SQL-Like Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
4.4.1 Relation Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
4.4.2 Sample Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
4.4.3 RelationSQL Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

11
5 Data Preprocessing 141
5.1 Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.1.1 Remove Identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.1.2 Convert String Columns to Numeric Columns . . . . . . . . . . . . . . . . . . . . . . . 141
5.1.3 Identify Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.1.4 Preliminary Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.2 Methods for Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.2.1 Based on Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.2.2 Based on InterQuartile Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.2.3 Based on Quantiles/Percentiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.3 Imputation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.3.1 Imputation Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.4 Align Multiple Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.5 Creating Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

II Modeling 149

6 Prediction 151
6.1 Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.1.1 Predictor Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.2 Quality of Fit for Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.2.1 Fit Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.3 Null Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.3.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.3.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.3.3 Optimization - Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.3.4 Example Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.3.5 NullModel Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.4 Simpler Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.4.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.4.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.4.3 Optimization - Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.4.4 Example Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.4.5 SimplerRegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
6.4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
6.5 Simple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.5.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.5.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.5.3 Optimization - Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
6.5.4 Example Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
6.5.5 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

12
6.5.6 SimpleRegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
6.6 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.6.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.6.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.6.3 Optimization - Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
6.6.4 Matrix Inversion Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
6.6.5 LU Factorization Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
6.6.6 Cholesky Factorization Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.6.7 QR Factorization Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.6.8 Use of Factorization in Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.6.9 Model Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
6.6.10 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
6.6.11 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
6.6.12 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
6.6.13 Regression Problem: Texas Temperatures . . . . . . . . . . . . . . . . . . . . . . . . . 192
6.6.14 Regression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
6.6.15 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
6.6.16 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
6.7 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
6.7.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
6.7.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
6.7.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
6.7.4 Centering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
6.7.5 The λ Hyper-parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
6.7.6 Comparing RidgeRegression with Regression . . . . . . . . . . . . . . . . . . . . . . 204
6.7.7 RidgeRegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
6.7.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
6.8 Lasso Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.8.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.8.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.8.3 Optimization Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
6.8.4 The λ Hyper-parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
6.8.5 Regularized and Robust Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
6.8.6 LassoRegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
6.8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
6.8.8 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
6.9 Quadratic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
6.9.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
6.9.2 Comparison of quadratic and Regression . . . . . . . . . . . . . . . . . . . . . . . . 214
6.9.3 SymbolicRegression.quadratic Method . . . . . . . . . . . . . . . . . . . . . . . . . 216
6.9.4 Quadratic Regression with Cross Terms . . . . . . . . . . . . . . . . . . . . . . . . . . 217
6.9.5 Response Surface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

13
6.9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
6.10 Cubic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
6.10.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
6.10.2 Comparison of cubic, quadratic and Regression . . . . . . . . . . . . . . . . . . . . 222
6.10.3 SymbolicRegression.cubic Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
6.10.4 Cubic Regression with Cross Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
6.10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
6.11 Symbolic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
6.11.1 Sample Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
6.11.2 As a Data Science Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
6.11.3 SymbolicRegression Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
6.11.4 Implementation of the apply Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
6.11.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
6.11.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
6.12 Transformed Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
6.12.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
6.12.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
6.12.3 Square Root Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
6.12.4 Log Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
6.12.5 Reciprocal Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
6.12.6 Box-Cox Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
6.12.7 Quality of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
6.12.8 TranRegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
6.12.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
6.13 Regression with Categorical Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
6.13.1 Handling Categorical Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
6.13.2 ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
6.13.3 RegressionCat Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
6.13.4 RegressionCat Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
6.13.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
6.14 Weighted Least Squares Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
6.14.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
6.14.2 Root Absolute Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
6.14.3 RegressionWLS Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
6.14.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
6.15 Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
6.15.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
6.15.2 PolyRegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
6.15.3 PolyORegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
6.15.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
6.16 Trigonometric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
6.16.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
6.16.2 TrigRegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

14
6.16.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256

7 Classification 257
7.1 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
7.1.1 Classifier Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
7.2 Quality of Fit for Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
7.2.1 FitC Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
7.3 Null Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
7.3.1 NullModel Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
7.3.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
7.4 Naı̈ve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
7.4.1 Factoring the Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
7.4.2 Estimating Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
7.4.3 Laplace Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
7.4.4 Table Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
7.4.5 The train Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
7.4.6 The test Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
7.4.7 The predictI Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
7.4.8 The lpredictI Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
7.4.9 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
7.4.10 NaiveBayes Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
7.4.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
7.5 Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
7.5.1 BayesClassifier Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
7.6 Tree Augmented Naı̈ve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
7.6.1 Structure Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
7.6.2 Conditional Probability Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
7.6.3 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
7.6.4 The train Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
7.6.5 The predictI Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
7.6.6 TANBayes Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
7.6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
7.7 Bayesian Network Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
7.7.1 Network Augmented Naı̈ve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
7.8 Markov Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
7.8.1 Markov Blanket . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
7.8.2 Factoring the Joint Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
7.8.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
7.9 Decision Tree ID3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
7.9.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
7.9.2 Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
7.9.3 Early Termination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
7.9.4 DecisionTree Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
7.9.5 DecisionTree ID3 Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

15
7.9.6 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
7.9.7 DecisionTree ID3wp Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
7.9.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
7.10 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
7.10.1 Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
7.10.2 Forward Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
7.10.3 Backward Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
7.10.4 Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
7.10.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
7.10.6 Reestimation of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
7.10.7 HiddenMarkov Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
7.10.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
7.11 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

8 Classification: Continuous Variables 299


8.1 Gaussian Naı̈ve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
8.1.1 NaiveBayesR Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
8.1.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
8.2 Simple Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
8.2.1 mtcars Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
8.2.2 Logistic Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
8.2.3 Logit Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
8.2.4 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
8.2.5 Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
8.2.6 Log-likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
8.2.7 Computation in ScalaTion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
8.2.8 Making a Decision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
8.2.9 SimpleLogisticRegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
8.2.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
8.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
8.3.1 LogisticRegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
8.3.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
8.4 Simple Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
8.4.1 SimpleLDA Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
8.4.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
8.5 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
8.5.1 LDA Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
8.5.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
8.6 K-Nearest Neighbors Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
8.6.1 Lazy Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
8.6.2 KNN Classifier Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
8.6.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
8.7 Decision Tree C45 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
8.7.1 Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318

16
8.7.2 DecisionTree C45 Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
8.7.3 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
8.7.4 DecisionTree C45wp Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
8.7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
8.8 Bagging Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
8.8.1 Creating Subsample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
8.8.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
8.8.3 Hyper-parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
8.8.4 BaggingTrees Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
8.9 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
8.9.1 Extracting Sub-features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
8.9.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
8.9.3 RandomForest Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
8.9.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
8.10 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
8.10.1 Separating Hyperplane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
8.10.2 Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
8.10.3 Running the Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
8.10.4 SupportVectorMachine Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
8.10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
8.11 Neural Network Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
8.11.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
8.11.2 Training Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
8.11.3 Prediction Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
8.11.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
8.11.5 NeuralNet Class 3L Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
8.11.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333

9 Generalized Linear Models and Regression Trees 335


9.1 Generalized Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
9.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
9.2.1 Akaike Information Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
9.2.2 MLE for Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
9.3 Poisson Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
9.3.1 PoissonRegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
9.4 Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
9.4.1 Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
9.4.2 Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
9.4.3 Determining Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
9.4.4 RegressionTree Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
9.5 Linear Model Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
9.5.1 Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
9.5.2 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
9.5.3 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347

17
9.5.4 RegressionTreeMT class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
9.6 Random Forest Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
9.6.1 RegressionTreeRF Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
9.7 Gradient Boosting Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
9.7.1 RegressionTreeGB Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
9.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
9.9 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351

10 Nonlinear Models and Neural Networks 353


10.1 Nonlinear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
10.1.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
10.1.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
10.1.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
10.1.4 Use of the Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
10.1.5 NonlinearRegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
10.1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
10.2 Simple Exponential Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
10.2.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
10.2.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
10.2.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
10.2.4 Linearization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
10.2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
10.3 Exponential Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
10.3.1 ExpRegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
10.3.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
10.4 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
10.4.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
10.4.2 Ridge Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
10.4.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
10.4.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
10.4.5 Example Calculation for  and δ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
10.4.6 Initializing Weights/Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
10.4.7 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
10.4.8 Basic Gradient Descent Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
10.4.9 Perceptron Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
10.4.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
10.5 Multi-Output Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
10.5.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
10.5.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
10.5.3 PredictorMV Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
10.5.4 RegressionMV Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
10.5.5 Optimizer Object and Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
10.5.6 NetParam Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
10.6 Two-Layer Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391

18
10.6.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
10.6.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
10.6.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
10.6.4 Matrix Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
10.6.5 NeuralNet 2L Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
10.6.6 NeuralNet 2L Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
10.6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
10.7 Three-Layer Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
10.7.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
10.7.2 Ridge Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
10.7.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
10.7.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
10.7.5 Matrix Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
10.7.6 train Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
10.7.7 Stochastic Gradient Descent Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 405
10.7.8 Example Error Calculation Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
10.7.9 Response Surface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
10.7.10 NeuralNet 3L Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
10.7.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
10.8 Multi-Hidden Layer Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
10.8.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
10.8.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
10.8.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
10.8.4 Number of Nodes in Hidden Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
10.8.5 Avoidance of Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
10.8.6 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
10.8.7 NeuralNet XL Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
10.8.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
10.9 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
10.101D CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
10.10.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
10.10.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
10.10.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
10.10.4 Matrix Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
10.10.5 Gradient Descent Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
10.10.6 Example Error Calculation Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
10.10.7 Two Convolutional Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
10.10.8 CNN 1D Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
10.10.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
10.112D CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
10.11.1 Filtering Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
10.11.2 Pooling Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
10.11.3 Flattening Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429

19
10.11.4 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
10.11.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
10.11.6 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
10.11.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
10.12Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
10.12.1 Definition of Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
10.12.2 Type of Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
10.12.3 NeuralNet XLT Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
10.12.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
10.13Extreme Learning Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
10.13.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
10.13.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
10.13.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
10.13.4 ELM 3L1 Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
10.13.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437

11 Time Series/Temporal Models 439


11.1 Forecaster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
11.1.1 Stats4TS Case Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
11.1.2 Auto-Correlation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
11.1.3 Correlogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
11.1.4 Quality of Fit (QoF) for Time Series Data . . . . . . . . . . . . . . . . . . . . . . . . . 444
11.2 Baseline Models: Random Walk, Null and Trend Models . . . . . . . . . . . . . . . . . . . . . 446
11.2.1 Random Walk Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
11.2.2 White Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
11.2.3 Detecting Random Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
11.2.4 RandomWalk Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
11.2.5 Null Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
11.2.6 NullModel Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
11.2.7 Trend Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
11.2.8 TrendModel Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
11.2.9 Forecasting Lake Levels - Battle of the Baselines . . . . . . . . . . . . . . . . . . . . . 448
11.2.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
11.3 Simple Exponential Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
11.3.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
11.3.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
11.3.3 Effect of the Smoothing Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
11.3.4 SimpleExpSmoothing Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
11.3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
11.4 Auto-Regressive (AR) Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
11.4.1 AR(1) Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
11.4.2 AR(p) Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
11.4.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
11.4.4 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463

20
11.4.5 AR Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
11.4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
11.5 Moving-Average (MA) Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
11.5.1 MA(q) Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
11.5.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
11.5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472
11.6 Auto-Regressive, Moving Average (ARMA) Models . . . . . . . . . . . . . . . . . . . . . . . . 473
11.6.1 Selection Based on ACF and PACF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
11.6.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
11.6.3 ARMA Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
11.6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
11.7 Rolling-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
11.7.1 1-Fold Rolling-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
11.7.2 Rolling Validation and the Forecasting Matrix . . . . . . . . . . . . . . . . . . . . . . 478
11.7.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
11.8 ARIMA (Integrated) Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
11.8.1 Differencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
11.8.2 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
11.8.3 Backshift Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
11.8.4 Stationarity Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
11.8.5 ARIMA Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
11.8.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
11.9 SARIMA (Seasonal) Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
11.9.1 Determination of the Seasonal Period . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
11.9.2 Seasonal Differencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
11.9.3 Seasonal AR and MA Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
11.9.4 Case Study: COVID-19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
11.9.5 SARIMA Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
11.9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
11.10Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495

12 Multivariate and Nonlinear Time Series 497


12.1 Auto-Regressive with eXogenous variables (ARX) Models . . . . . . . . . . . . . . . . . . . . 498
12.1.1 The ARX(p) Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
12.1.2 The ARX(p, [a, b]) Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
12.1.3 The ARX(p, n, [a, b]) Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
12.1.4 Determining the Exogenous Lag Interval [a, b] . . . . . . . . . . . . . . . . . . . . . . . 499
12.1.5 Time Series Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
12.1.6 ARXA (p, n, k) Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
12.1.7 ARXA MV Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
12.1.8 ARX Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
12.1.9 ARX MV Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
12.1.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
12.2 SARIMAX Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503

21
12.2.1 Model Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
12.2.2 SARIMAX Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
12.2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
12.3 Vector Auto-Regressive (VAR) Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
12.3.1 VAR(p, 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
12.3.2 VAR(p, n) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
12.3.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
12.3.4 VAR Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
12.3.5 AR∗ (p, n) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
12.3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
12.4 Nonlinear Time Series Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
12.4.1 Nonlinear Autoregressive (NAR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
12.4.2 Autoregressive Neural Network (ARNN) . . . . . . . . . . . . . . . . . . . . . . . . . . 509
12.4.3 Nonlinear Autoregressive, Moving-Average (NARMA) . . . . . . . . . . . . . . . . . . 509
12.5 Recurrent Neural Networks (RNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
12.5.1 RNN(1, 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
12.5.2 RNN(p, nh ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
12.5.3 RNN(p, nh , nv ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
12.5.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
12.5.5 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
12.5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
12.6 Gated Recurrent Unit (GRU) Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
12.6.1 A GRU Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
12.6.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
12.6.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
12.6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
12.7 Minimal Gated Unit (MGU) Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
12.8 Long Short Term Memory (LSTM) Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
12.8.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
12.9 Encoder-Decoder Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
12.9.1 Simple Encoder-Decoder Consisting of Two GRU Cells . . . . . . . . . . . . . . . . . . 530
12.9.2 Teacher Forcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
12.9.3 Attention Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
12.9.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
12.10Transformer Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
12.10.1 Self-Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
12.10.2 Positional Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
12.10.3 Encoder-Decoder Architecture for Transformers . . . . . . . . . . . . . . . . . . . . . . 536
12.10.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
12.10.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538

22
13 Dimensionality Reduction 539
13.1 Reducer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
13.2 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
13.2.1 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
13.2.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
13.3 Autoencoder (AE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
13.3.1 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
13.3.2 Denoising Autoencoder (DEA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543

14 Clustering 545
14.1 KNN Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
14.1.1 KNN Regression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
14.1.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
14.2 Clusterer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
14.3 K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
14.3.1 Initial Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
14.3.2 Reassignment of Points to Closest Clusters . . . . . . . . . . . . . . . . . . . . . . . . 552
14.3.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
14.3.4 KMeansClusterer Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
14.3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554
14.4 K-Means Clustering - Hartigan-Wong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
14.4.1 Adjusted Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
14.4.2 KMeansClusteringHW Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
14.4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
14.5 K-Means++ Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558
14.5.1 Picking Initial Centroids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558
14.5.2 KMeansClustererPP Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
14.5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
14.6 Clustering Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
14.6.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
14.6.2 ClusteringPredictor Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
14.6.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
14.7 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
14.7.1 HierClusterer Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
14.7.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
14.8 Markov Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
14.8.1 MarkovClusterer Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
14.8.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564

III Simulation 567

15 Simulation Foundations 569


15.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571

23
15.2 Types of Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
15.2.1 Example: Modeling an M/M/1 Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
15.3 Random Number Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
15.3.1 Example RNG: Random0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
15.3.2 Testing Random Number Generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574
15.3.3 Example RNG: Random3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
15.3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
15.4 Random Variate Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
15.4.1 Inverse Transform Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
15.4.2 Convolution Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
15.4.3 Acceptance-Rejection Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
15.4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
15.5 Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
15.5.1 Generating a Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583
15.5.2 Generating a Non-Homogeneous Poisson Process . . . . . . . . . . . . . . . . . . . . . 585
15.5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
15.6 Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
15.6.1 Simulation of Card Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
15.6.2 Integral of a Complex Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588
15.6.3 Grain Dropping Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590
15.6.4 Simulation of the Monty Hall Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
15.6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
15.7 Hand Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
15.7.1 Little’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
15.7.2 Event Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
15.7.3 Spreadsheet Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
15.7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
15.8 Tableau-Oriented Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
15.8.1 Iterating through Tableau Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
15.8.2 Reproducing the Hand Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600
15.8.3 Customized Logic/Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600
15.8.4 Tableau.scala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
15.8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602

16 State Space Models 603


16.1 Example: Trajectory of a Ball in One-Dimensional Space . . . . . . . . . . . . . . . . . . . . 604
16.1.1 Ordinary Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
16.1.2 Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
16.1.3 Trajectory Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
16.1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
16.2 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607
16.2.1 Probability Mass Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608
16.2.2 Reducible Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
16.2.3 Limiting/Steady-State Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609

24
16.2.4 MarkovChain Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
16.2.5 Continuous-Time Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
16.2.6 Limiting/Steady-State Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
16.2.7 MarkovChainCT Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614
16.2.8 Queueing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614
16.2.9 MMc Queue Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616
16.2.10 MMcK Queue Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616
16.2.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
16.3 Dynamic Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620
16.3.1 Example: Traffic Sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621
16.3.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622
16.4 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
16.4.1 Example: Golf Ball Trajectory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
16.4.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625
16.4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626
16.5 Extended Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
16.5.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
16.5.2 Example: SEIHRD Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 628
16.5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
16.6 ODE Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638

17 Event-Oriented Models 639


17.1 A Taxonomy/Ontology for Simulation Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 639
17.2 List Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642
17.2.1 FCFS Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642
17.2.2 LCFS Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
17.2.3 Priority Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
17.2.4 Time Advance Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
17.3 Event Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645
17.3.1 Event Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645
17.3.2 Example: Bank Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647
17.3.3 Example: Call Center Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653
17.3.4 Entity Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654
17.3.5 WaitQueue Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655
17.3.6 WaitQueue LCFS Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656
17.3.7 Model Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656
17.3.8 Example: Machine Shop Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657
17.4 Event Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 660
17.4.1 Example: Bank Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 660
17.4.2 EventNode Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663
17.4.3 CausalLink Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663
17.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664

25
18 Process-Oriented Models 667
18.1 Base Traits and Classes for Process-Oriented Models . . . . . . . . . . . . . . . . . . . . . . . 668
18.1.1 Identifiable Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
18.1.2 Locatable Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
18.1.3 Modelable Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
18.1.4 Temporal Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
18.2 Concurrent Processing of Actors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
18.2.1 Java’s Thread Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
18.2.2 ScalaTion’s Coroutine Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670
18.3 Process Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672
18.3.1 Model Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673
18.3.2 Component Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674
18.3.3 Example: BankModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674
18.3.4 Executing the Bank Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677
18.3.5 Network Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677
18.3.6 Comparison to Event Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677
18.3.7 SimActor Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678
18.3.8 Source Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679
18.3.9 Sink Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680
18.3.10 Transport Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681
18.3.11 Resource Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681
18.3.12 WaitQueue Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682
18.3.13 WaitQueue LCFS Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683
18.3.14 Junction Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683
18.3.15 Gate Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684
18.3.16 Route Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685
18.3.17 Model Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685
18.3.18 Vehicle Traffic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687
18.3.19 Model MBM Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690
18.3.20 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690
18.4 Agent-Based Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693
18.4.1 SimAgent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694
18.4.2 Vertices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695
18.4.3 Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695
18.4.4 Bank Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696
18.4.5 Vehicle Traffic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697
18.4.6 Hybrid Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 698
18.4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 698
18.5 Animation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 700
18.5.1 2D Animation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 700
18.5.2 3D Animation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703
18.5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703

26
19 Simulation Output Analysis 705
19.1 Point and Interval Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705
19.2 One-Shot Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707
19.3 Simulation Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 708
19.3.1 Model Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 709
19.3.2 Model Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 710
19.4 Method of Independent Replications (MIR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 711
19.4.1 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 712
19.4.2 Example: MIR Version of BankModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713
19.5 Method of Batch Means (MBM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715
19.5.1 Effect of Increasing the Number of Batches . . . . . . . . . . . . . . . . . . . . . . . . 715
19.5.2 Effect on Batch Correlation of Increasing the Batch Size . . . . . . . . . . . . . . . . . 716
19.5.3 MBM versus MIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716
19.5.4 Relative Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717
19.5.5 Example: MBM Version of BankModel . . . . . . . . . . . . . . . . . . . . . . . . . . . 717
19.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 720

Appendices 721

A Optimization in Data Science 723


A.1 Partial Derivatives and Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724
A.1.1 Basic Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724
A.1.2 Chain Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724
A.1.3 Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725
A.1.4 Generalized Chain Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725
A.1.5 calculus Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726
A.2 Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 728
A.2.1 Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 728
A.2.2 Reverse Mode Backward Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . 728
A.2.3 Example Calculation for Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 729
A.2.4 Example for Three-Layer Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 731
A.2.5 Partial Derivatives w.r.t. B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 731
A.2.6 Partial Derivatives w.r.t. A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 732
A.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734
A.3.1 Line Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734
A.3.2 Application to Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736
A.3.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736
A.4 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737
A.4.1 Using SGD to Train Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 738
A.5 Stochastic Gradient Descent with Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . 740
A.5.1 Using SGDM to Train Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 741
A.5.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 743
A.6 SGD with ADAptive Moment Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744
A.6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745

27
A.7 Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746
A.8 Conjugate Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 748
A.8.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 749
A.9 Quasi-Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 750
A.9.1 Newton-Raphson Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 750
A.9.2 Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 750
A.9.3 BFGS Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 751
A.9.4 Limited Memory-BFGS Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 752
A.9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753
A.9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754
A.10 Method of Lagrange Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755
A.10.1 Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755
A.11 Karush-Kuhn-Tucker Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757
A.11.1 Active and Inactive Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757
A.12 Quadratic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758
A.13 Augmented Lagrangian Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 760
A.13.1 Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 760
A.13.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 761
A.14 Alternating Direction Method of Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762
A.14.1 Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763
A.14.2 LassoAddm Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763
A.14.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763
A.15 Nelder-Mead Simplex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765
A.15.1 NelderMeadSimplex Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766
A.15.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766

B Graph Databases and Analytics 767


B.1 Directed Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 769
B.1.1 Adding Vertex Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 770
B.1.2 Adding Edge Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 771
B.1.3 Directed Multi-Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 773
B.1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775
B.2 A Graph Database with Relational Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776
B.2.1 The GTable Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776
B.2.2 Creating Graph Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777
B.2.3 Graph Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777
B.2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778
B.3 Property Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 779
B.3.1 Structure of a Property Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 779
B.3.2 Native Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 784
B.3.3 High-Level Query Language for Graph Databases . . . . . . . . . . . . . . . . . . . . . 786
B.3.4 Graph Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788
B.3.5 Query Processing in Graph Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . 791
B.4 Special Types of Graph Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 792

28
B.4.1 Embedding Relationships in Vertex-Types . . . . . . . . . . . . . . . . . . . . . . . . . 792
B.4.2 Resource Description Framework (RDF) Graphs . . . . . . . . . . . . . . . . . . . . . 793
B.4.3 From Relational to Graph Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794
B.5 Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795
B.5.1 Type Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795
B.5.2 Constraints and Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796
B.5.3 KGTable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797
B.6 Exercises - Part I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 799
B.7 Graph Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801
B.7.1 Path Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801
B.7.2 Centrality and Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801
B.7.3 Community Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801
B.8 Graph Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 802
B.8.1 Graph Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 802
B.8.2 Subgraph Isomorphism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803
B.8.3 Graph Homomorphism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803
B.8.4 Application to Query Processing in Graph Databases . . . . . . . . . . . . . . . . . . 803
B.9 Graph Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804
B.9.1 Matrix Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804
B.9.2 Graph Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805
B.10 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806
B.10.1 AGGREGATE and COMBINE Operations . . . . . . . . . . . . . . . . . . . . . . . . 807
B.11 Exercises - Part II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808

29
30
Preface

Applied Mathematics accelerated starting with the differential equations of Euler’s analytical mechanics
published in early 1700s [45, 117]. Over time increasingly accurate mathematical models of natural phenomena
were developed. The models are scrutinized by how well they match empirical data and related models.
Theories were developed that featured a collection of consistent, related models. In his theory of Universal
Gravity [132], Newton argues the sufficiency of this approach, while others seek to understand the underlying
substructures and causal mechanisms [117].
Data Science can trace its linage back to Applied Mathematics. One way to represent a mathematical
model is as a function f : Rn → R.

y = f (x, b) + 

This illustrates that a response variable y is functionally related to other predictive variables x (vector in
bold font). Uncertainty in the relationship is modeled as a random variable  (blue font) that follows some
probability distribution.
Making useful predictions or even inferences that one product lasts longer than another product are
clouded by this uncertainty. DeMoivre developed a limiting distribution for the Binomial Distribution.
Laplace derived a central limit theorem that showed that the sample means from several distributions follow
this same distribution. Gauss [180] studied this uncertainty and deduced a distribution for measurement
errors from basic principles. This distribution is now known as the Gaussian or Normal distribution. Infer-
ences such as which of two products has the longer expected lifetimes can now be made to a certain level of
confidence. Gauss also developed the method of least squares estimation.
Momentum in using probability distributions to analyze data, fit parameters and make inferences under
uncertainty lead to mathematical statistics emerging from applied mathematics in the late 1800s. In par-
ticular, Galton and Pearson collected and transformed statistical techniques into a mathematical discipline
(e.g., Pearson correlation coefficient, method of moments estimation, p-value, Chi-square test, statistical
hypothesis testing, principal component analysis). In the early 1900s, Gosset and Fisher expanded mathe-
matical statistics (e.g., analysis of variance, design of experiments, maximum likelihood estimation, Student’s
t-distribution, F-distribution).
With the increasing capabilities of computers, the amount of data available for training models grew
rapidly. This lead Computer Scientists into the fray with machine learning coined in 1959 and data mining
beginning in the late 1980s. Machine Learning developed slowly over the decades until the invention of the
back-propagation algorithm for neural networks in the mid 1980s lead to important advances. Data Mining
billed itself as finding patterns in data. Databases are often utilized and data preprocessing is featured in
the sense that mining through large amounts of data should be done with care.

31
With greater computing capabilities and larger amounts of data, statistics and machine learning are
leaning toward each other: The emphasis is to develop of accurate, interpretable and explainable models
for prediction, classification and forecasting. Data may also be clustered and simulation models that mimic
phenomena or systems may be created. Training a model is typically done using an optimization algorithm
(e.g., gradient descent) to minimize the errors in the model’s predictions. These constitute the elements of
data science.
This book is an introduction to data science that includes mathematical and computational foundations.
It is divided into three parts: (I) Foundations, (II) Modeling, and (III) Simulation. A review of Optimization
from the point of view of data science is included in the Appendix. The level of the book is College Junior
through beginning Graduate Student. The ideal mathematical background includes Differential, Integral and
Vector Calculus, Applied Linear Algebra, Probability and Mathematical Statistics. The following advanced
topics may be found useful for Data Science: Differential Equations, Nonlinear Optimization, Measure
Theory, Functional Analysis and Differential Geometry. Data Science also involves Computer Programming,
Database Management, Data Structures and Algorithms. Advanced topics include Parallel Processing,
Distributed Systems and Big Data frameworks (e.g., Hadoop and Spark). This book has been used in the
Data Science I and Data Science II courses at the University of Georgia.

32
Chapter 1

Introduction to Data Science

1.1 Data Science


The field of Data Science can be defined in many ways. To its left is Machine Learning that emphasizes
algorithms for learning, while to its right is Statistics that focuses on procedures for estimating parame-
ters of models and determining statistical properties of those parameters. Both fields develop models to
describe/predict reality based on one or more datasets. Statistics has a greater interest in making inferences
or testing hypotheses based upon datasets. It also has a greater interest in fitting probability distributions
(e.g., are the residuals normally or exponentially distributed).
The common thread is modeling. A model should be able to make predictions (where is the hurricane
likely to make landfall, when will the next recession occur, etc.). In addition, it may be desirable for a model
to enhance the understanding of the system under study and to address what-if type questions (perspective
analytics), e.g., how will traffic flow improve/degrade if a light-controlled intersection is replaced with a
round-about.
A model may be viewed as replacement for a real system, phenomena to process. A model will map inputs
into outputs with the goal being that for a given input, the model will produce output that approximates
the output that the real system would produce. In addition to inputs and outputs, some models include
state information. For example, the output of a heat pump will depend if it is in the heating or cooling
state (internally this determines the direction of flow of the refrigerant). Further, some types of models are
intended to mimic the behavior of the actual system and facilitate believable animation. Examples of such
models are simulation models. They support prescriptive analytics which enables changes to a system to
tested on the model, before the often costly changes to the actual system are under taken.
Broad categories of modeling are dependent of the type output (also called response) of the model. When
the response is treated as a continuous variable, a predictive model (e.g., regression) is used. If the goal
is to forecast into the future (or there is dependency among the response values), a forecasting model
(e.g., ARIMA) is used. When the response is treated as a categorical variable, a classification model (e.g.,
support vector machine) is used. When the response values are largely missing, a clustering model may be
used. Finally, when values are missing from a data matrix, an imputation model (k-nearest neighbors) or
recommendation model (e.g., low-rank approximation using singular value decomposition) may be used.
Dimensionality reduction (e.g., principal component analysis) can be useful across categories.
Computational Data Science puts more emphasis on computational issues, such as optimization algo-

33
rithms for used for learning. Mathematical derivations are provided for the loss functions that are used
to train the models. Short Scala code snippets are provided to illustrate how the algorithms work. The
Scala object-oriented, functional language allows the creation of coincide code that looks very much like the
mathematical expressions. Modeling based on ordinary differential equations and simulation models are also
provided.
The prerequisite material for data science includes Vector Calculus, Applied Linear Algebra and Calculus-
based Probability and Statistics. Datasets can be stored as vectors and matrices, learning/parameter esti-
mation often involves taking gradients, and probability and statistics are needed to handle uncertainty.

34
1.2 ScalaTion
ScalaTion supports multi-paradigm modeling that can be used for simulation, optimization and analytics.
In ScalaTion, the modeling package provides tools for performing data analytics. Datasets are becom-
ing so large that statistical analysis or machine learning software should utilize parallel and/or distributed
processing. Databases are also scaling up to handle greater amounts of data, while at the same time increas-
ing their analytics capabilities beyond the traditional On-Line Analytic Processing (OLAP). ScalaTion
provides many analytics techniques found in tools like MATLAB, R and Weka. The analytics component
contains six types of tools: predictors, classifiers, forecasters, clusterers, recommenders and reduc-
ers. A trait is defined for each type.
To use ScalaTion, go to the Website http://www.cs.uga.edu/~jam/scalation.html and click on the
most recent version of ScalaTion and follow the first three steps: download, unzip, build.
Current projects are targeting Big Data Analytics in four ways: (i) use of sparse matrices, (ii) parallel
implementations using Scala’s support for parallelism (e.g., .par methods, parallel collections and actors),
(iii) distributed implementations using Akka, and (iv) high performance data stores including columnar
databases (e.g., Vertica), document databases (e.g., MongoDB), graph databases (e.g., Neo4j) and distributed
file systems (e.g., HDFS).

1.2.1 Package Structure


The package structure of ScalaTion is devided in four modules. Each modules can be independently com-
piled (e.g., package directories can be made removed or made inaccessible using the chmod 000 command).
The core module of ScalaTion consists of the following packages:

1. scalation - general utilities for the rest of ScalaTion packages.

2. mathstat - vectors, matrices, tensors and basic statistics

3. random - random number and random variate generators

4. scala2d - basic UI widgets and 2D graphics

The intermediate module of ScalaTion consists of the following packages:

1. animation - general purpose animation code

2. database - basic implementations for relational and graph databases

The modeling module of ScalaTion consists of the following packages:

1. calculus - derivatives, gradients, Jacobians, Hessians and integrals

2. modeling - regression models with sub-packages for classification, clustering, neural networks, and time
series

3. optimization - derivative-free, first-order, and second-order optimizers

The simulation module of ScalaTion consists of the following packages:

1. dynamics - differential equations: ODE and PDE solvers

35
2. simulation - multiple simulation engines

The scala3d package is under development.

1.2.2 Scala 3 Control Structures


This section gives the Scala 3 control structures along with their Python equivalents:

• if
1 if x < y then if x < y :
2 x += 1 x += 1
3 else if x > y then elsif x > y :
4 y += 1 y += 1
5 else else :
6 x += 1 x += 1
7 y += 1 y += 1
8 end if

The else and end are optional, as are the line breaks. Note, the x += 1 shortcut simply means x =
x + 1 for both languages.

• match
1 z = c match match c :
2 case ’+ ’ = > x + y case ’+ ’:
3 z = x + y
4 case ’ - ’ = > x - y case ’ - ’:
5 z = x - y
6 case ’* ’ = > x * y case ’* ’:
7 z = x * y
8 case ’/ ’ = > x / y case ’/ ’:
9 z = x / y
10 case _ = > println ( " not supported " ) case -:
11 print ( " not supported " )

In Scala 3, the case may be indented like Python. Also an end may be added.

• while
1 while x <= y do while x <= y :
2 x + = 0.5 x + = 0.6
3 end while

The end is optional, as are the line breaks.

• for
1 for i <- 0 until 10 do for i in range (0 , 10) :
2 a ( i ) = 0.5 * i ~ ˆ 2 a [ i ] = 0.5 * i **2
3 end for

The end is optional, as are the line breaks. Note: for i <- 0 to 10 do will include 10, while until
will stop at 9. Both Scala and Python support other variaties of for. The for-yield collects all the
computed values into a.

36
1 val a = for i <- 0 until 10 yield 0.5 * i ~ ˆ 2

• cfor
1 var i = 0
2 cfor ( i < 10 , i + = 1) {
3 a ( i ) = 0.5 * i ~ ˆ 2
4 } // cfor

This for follows more of a C-style, provides improved efficiency and allows returns inside the loop. It
is defined as follows:
1 inline def cfor ( pred : = > Boolean , step : = > Unit ) ( body : = > Unit ) : Unit =
2 while pred do { body ; step }
3 end cfor

• try
1 try try :
2 file = new File ( " myfile . csv " ) x = 1 / 0
3 catch except Z e r o D i v i s i o n E r r o r :
4 case ex : FileNotFound = > println ( " not found " ) print ( " division by zero " )
5 end try

The end is optional and a finally clause is available. Both support a finally clause and Python
provides a shortcut with statement that comes in handy for opening files and automatically closing
them at the statement’s end of scope.

• assign with if
1 val y = if x < 1 then sqrt ( x ) else x ~ ˆ 2 y = sqrt ( x ) if x < 1 else x **2

All Scala control structures return values and so can be used in assignment statements. Note, prefix
sqrt with math for Python.

Note, the end tags are optional since Scala 3 uses significant indentation like Python.

1.2.3 Scala 3 Top-Level Functions


Both Scala 3 and Python support top level functions as well as methods inside classes. Here are functions
to compute the length of the hypotenuse of a right triangle with lengths a and b.
1 import scala . math . sqrt
2

3 def hypotenuse ( a : Double , b : Double ) : Double =


4 sqrt ( a ~ ˆ 2 + b ~ ˆ 2)

Optionally, an end hypotenuse may be added and is often useful for functions which include several lines
of code. The Python code below is very similar, with the exception of the exponentiation operator ~^ for
ScalaTion and ** in Python. Outside of Scalation import scalation.~^. Both Double in Scala and
float in Python indicate 64-bit floating point numbers.

37
1 import math
2

3 def hypotenuse ( a : float , b : float ) -> float =


4 math . sqrt ( a ** 2 + b ** 2)

The dot product operator on vectors is used extensively in data science. It multiplies all the elements
in the two vectors and then sums the products. An implementation in ScalaTion is given followed by a
similar implementation in Python that includes type annotations for improved readability and type checking.
1 import scalation . mathstat . VectorD
2

3 def dot ( x : VectorD , y : VectorD ) : Double =


4 ( x * y ) . sum

1 import numpy as np
2

3 def dot ( x : np . ndarray , y : np . ndarray ) -> float :


4 return float ( np . sum ( x * y ) )

Note, see the Chapter on Linear Algebra for more efficient implementations of dot product. Also, both
numpy.ndarray and VectorD directly provide dot product.
1 val z = x dot y

1 z = x . dot ( y )

In cases where the arguments are 2D arrays, np.dot is the same as matrix multiplication (x @ y) and for
scalars it is simple multiplication (x * y). ScalaTion supports several forms of multiplication for both
vectors and matrices (see the Linear Algebra Chapter).

Executable Top-Level Functions

Executable top-level functions can also be defined in similar ways in both Scala 3 and Python.
1 @ main def hello () : Unit =
2 val message = " Hello Data Science "
3 println ( s " message = $message " )

1 def main () -> None :


2 message = " Hello Data Science "
3 print ( " message = " , message )

1.2.4 Classes
Defining a class is a good way to combine a data structure with its natural operations. The class will
consists of fields/attributes for maintaining data and methods for retrieving and updating the data.
An example of a class in Scala 3 is the Complex class that supports complex number (e.g., 2.1 + 3.2i)
and operations on complex numbers such as the + method. Of course, the actual implementation provides
many methods (see scalation.mathstat.Complex).
1 @ param re the real part ( e . g . , 2.1)
2 @ param im the imaginary part ( e . g . , 3.2)
3

4 class Complex ( re : Double , im : Double = 0.0)

38
5 extends Fractional [ Complex ] with Ordered [ Complex ]:
6

7 def + ( c : Complex ) : Complex = Complex ( re + c . re , im + c . im )

Notice that second argument im provides a default value of 0.0, so the class can be instantiated using either
one or two arguments/parameters.

1 val c1 = new Complex (2.1 , 3.2)


2 var c2 = new Complex (2.0)

Also observe that first variable cannot be reassigned as it is declared val, while the second variable c2 can be
as it is declared var. Finally, notice that the Complex class extends both Fractional and Ordered. These
are traits that the class Complex inherits. Some of the functionality (e.g., method implementations) can be
provided by the trait itself. The class must implement those that are not implemented or override the ones
with implementations to customize their behavior, if need be. Classes can extend several traits (multiple
inheritance), but may only extend one class (single inheritance).
Although Python already has a class called Complex, one could image coding one as follows:

1 class Complex :
2 def __init__ ( self , re : float , im : float = 0.0) :
3 self . re = re
4 self . im = im
5

6 def __add__ ( self , c : Complex ) -> Complex :


7 return Complex ( self . re + c . re , self . im + c . im )

Notice there are few differences: The constructor for Scala is any code at the top level of the class and
arguments to the constructor are given in the class definition, while Python has an explicit constructor
called init . Scala has an implicit reference to the instance object called this, while Python has an
explicit reference to the instance object called self. Furthermore, naming the method add makes it so
+ can be used to add two complex numbers. Another difference (not shown here) is that fields/attributes
as well as methods in Scala can be made internal using the private access modifier. In Python, a code
convention of having the first character of an identifier be underscore ( ) indicates that it should not be used
externally.

1.2.5 Basic Types

The basic data-types in Scala are integer types: Byte (8 bits), Short (16), Int (32) and Long (64), floating
point types: Float (32) and Double (64), character types: Char (single quotes) and String (double quotes),
and Boolean.
Corresponding Python data types are integer types: int (unlimited), floating point types: float32 (32)
and float (64), complex (128), character types: str (single or double quotes), and bool.
There are many operators that can be applied to these data-types, see https://docs.scala-lang.
org/tour/operators.html for the precedence of the operators. ScalaTion adds a few itself such as ˜^
for exponentiation. Also, ScalaTion provides complex numbers via the Complex class in the mathstat
package.

39
1.2.6 Collection Types
The most commonly used collection types in Scala are Array, ArrayBuffer, Range, List, Map, Set, and
Tuple. The Python rough equivalents (in lower case) are on the right (Map becomes dict).
1 val a = Array . ofDim [ Double ] (10) a = np . zeros (10)
2 val b = ArrayBuffer (2 , 3 , 3)
3 val r = 0 until 10 r = range (10)
4 val l = List (2 , 3 , 3) l = [2 , 3 , 3]
5 val m = Map ( " no " -> 0 , " yes " -> 1) m = { " no " : 0 , " yes " : 1}}
6 val s = Set (1 , 2 , 3 , 5 , 7) s = {1 , 2 , 3 , 5 , 7}
7 val t = ( firstName , lastName ) t = ( firstName , lastName )

For more collection types consult their documentation: https://scala-lang.org/api/3.x/ for Scala and
https://docs.python.org/3/library/collections.html for Python. Scala typically has mutable and
immutable versions of most collection types.

1.2.7 ScalaTion: Vectors, Matrices and Tensors


It is easy to make vectors, matrices and tensors in ScalaTion, via the VectorD, MatrixD, and TensorD
classes provided in the mathstat package. The following is a vector (1D array) consisting of 9 Doubles,
corresponding to float in Python.
1 val y = VectorD (1 , 2 , 4 , 7 , 9 , 8 , 6 , 5 , 3)

A matrix is a 2D array, that in this case is a 9-by-2 matrix holding two variables/features x0 and x1 in
columns of the matrix.
1 // col0 col1
2 val x = MatrixD ((9 , 2) , 1 , 8 , // row 0
3 2, 7, // row 1
4 3, 6, // row 2
5 4, 5, // row 3
6 5, 5, // row 4
7 6, 4 // row 5
8 7, 4, // row 6
9 8, 3, // row 7
10 9 , 2) // row 8

As practice, try to find a vector b of length/dimension 2, so that x * b is close to y. The * operator does
matrix-vector multiplication. It takes the dot product of the ith row of matrix x and vector b to obtain the
ith element in the resulting vector.
In Python, numpy arrays can be used to do the same thing. The following 1D array can represent a
vector. Note the use of period “1.” to make the elements be floating point numbers. The “D” indicates such
for ScalaTion.
1 y = np . array ([1. , 2. , 4. , 7. , 9. , 8. , 6. , 5. , 3.])

Using double square brackets “[[”, numpy can be used to represent matrices. Each “[ ... ]” corresponds to a
row in the matrix.
1 # col0 col1
2 x = np . array ([[1. , 8.] , # row 0
3 [2. , 7.] , # row 1
4 [3. , 6.] , # row 2

40
5 [4. , 5.] , # row 3
6 [5. , 5.] , # row 4
7 [6. , 4.] , # row 5
8 [7. , 4.] , # row 6
9 [8. , 3.] , # row 7
10 [9. , 2.]]) # row 8

Matrix multiplication-vector is similar in Python x.dot (b).


The following is a ScalaTion tensor (3D array).
1 // 4 rows , 3 columns , 2 sheets - x_ijk
2 // row columns sheet
3 val z = TensorD ((4 , 3 , 2) , 1 , 2 , 3 , // 0 0 ,1 ,2 0
4 4, 5, 6, // 1 0 ,1 ,2 0
5 7, 8, 9, // 2 0 ,1 ,2 0
6 10 , 11 , 12 , // 3 0 ,1 ,1 0
7

8 13 , 14 , 15 , // 0 0 ,1 ,2 1
9 16 , 17 , 18 , // 1 0 ,1 ,2 1
10 19 , 20 , 21 , // 2 0 ,1 ,2 1
11 22 , 23 , 24) // 3 0 ,1 ,2 1

In Python, the above tensor can be defined as a 3D numpy array. Each row and column position has two
sheet values, e.g., ”[1., 13.]”.
1 # column 0 column 1 column 2
2 z = np . array ([[[1. , 13.] , [2. , 14.] , [3. , 15.]] , # row 0
3 [[4. , 16.] , [5. , 17.] , [6. , 18.]] , # row 1
4 [[7. , 19.] , [8. , 20.] , [9. , 21.]] , # row 2
5 [[10. , 22.] , [11. , 23.] , [12. , 24.]]]) # row 3

Vectors, matrices and tensors will discussed in greater detail in the Linear Algebra Chapter.

41
1.3 A Data Science Project
The orientation of this textbook is that of developing modeling techniques and the understanding of how
to apply them. A secondary goal is to explain the mathematics behind the models in sufficient detail to
understand the algorithms implementing the modeling techniques. Concise code based on the mathematics
is included and explained in the textbook. Readers may drill down to see the actual ScalaTion code.
The textbook is intended to facilitate trying out the modeling techniques as they are learned and to
support a group-based term project that includes the following ten elements. The term project is to culminate
in a presentation that explains what was done concerning these ten elements.

1. Problem Statement. Imagine that your group is hired as consultants to solve some problem for a
company or government agency. The answers and recommendations that your group produces should
not depend solely on prior knowledge, but rather on sophisticated analytics performed on multiple
large-scale datasets. In particular, the study should be focused and the purpose of the study should
clearly stated. What not to do: The following datasets are relevant to the company, so we ran them
through an analytics package (e.g., R) and obtained the following results.

2. Collection and Description of Datasets. To reduce the chances of results being relevant only to a
single dataset, multiple datasets should be used for the study (at least two). Explanation must be given
to how each dataset relates to the other datasets as well as to the problem statement. When a dataset
in the form of a matrix, metadata should be collected for each column/variable. In some cases the
response column(s)/variable(s) will be obvious, in others it will depend on the purpose of the study.
Initially, the result of columns/variables may be considered as features that may be useful in predicting
responses. Ideally, the datasets should loaded into a well-designed database. ScalaTion provides two
high-performance database systems: a relational database system and a graph database system
in scalation.database.table and scalation.database.graph, respectively.

3. Data Preprocessing Techniques Applied. During the preprocessing phase (before the modeling
techniques are applied), the data should be cleaned up. This includes elimination of features with zero
variance or too many missing values, as well as the elimination of key columns (e.g., on the training
data, the employee-id could perfectly predict the salary of an employee, but is unlikely to be of any
value in making predictions on the test data). For the remaining columns, strings should be converted
to integers and imputation techniques should be used to replace missing values.

4. Visual Examination. At this point, Exploratory Data Analysis (EDA) should be applied. Com-
monly, one column of a dataset in the combined data matrix will be chosen as the response column,
call it the response vector y, and the rest of the columns that remain after preprocessing form m-by-n
data matrix X. In general models are of the form

y = f (x) +  (1.1)

where f is function mapping feature vector x into a predicted value for response y. The last term may be
viewed as random error . In an ideal model, the last term will be error (e.g., white noise). Since most
models are approximations, technically the last term should be referred to as a residual (that which
is not explained by the model). During exploratory data analysis, the value of y, should be plotted
against each feature/column x:j of data matrix X. The relationships between the columns should

42
be examined by computing a correlation matrix. Two columns that are very highly correlated are
supplying redundant information, and typically, one should be removed. For a regression type problem,
where y is treated as continuous random variable, a simple linear regression model should be created
for each feature xj ,

y = b0 + b1 xj +  (1.2)

where the parameters b = [b0 , b1 ] are to be estimated. The line generated by the model should be
plotted along with the {(xij , yi )} data points. Visually, look for patterns such white noise, linear
relationship, quadratic relationship, etc. Plotting the residuals {(xij , i )} will also be useful. One
should also create Histograms and Box-Plots for each variable as well as consider removing outliers.

5. Modeling Techniques Chosen. For every type of modeling problem, there is the notions of a
NullModel: For prediction it is guess the mean, i.e., given a feature vector z, predict the value E [y],
regardless of the value of z. The coefficient of determination R2 for such models will be zero. If a
more sophisticated model cannot beat the NullModel, it is not helpful in predicting or explaining the
phenomena. Projects should include four classes of models: (i) NullModel, (ii) simple, easy to explain
models (e.g., Multiple Linear Regression), (iii) complex, performant models (e.g., Quadratic Regression,
Extreme Learning Machines) (iv) complex, time-consuming models (e.g., Neural Networks). If classes
(ii-iv) do not improve upon class (i) models, new datasets should be collected. If this does not help, a
new problem should be sought. On the flip side, if class (ii) models are nearly perfect (R2 close to 1),
the problem being addressed may be too simple for a term project. At least one modeling technique
should be chosen from each class.

6. Explanation of Why Techniques Were Chosen. As a consultant to a company, a likely question


will be, ”why did you chose those particular modeling techniques”? There are an enormous number of
possible modeling techniques. Your group should explain how the candidate techniques were narrowed
down and ultimately how the techniques were chosen. A review of how well the selected modeling
techniques worked, as well as suggested changes for future work, should be given at the end of the
presentation.

7. Feature Selection. Although feature selection can occur during multiple phases in a modeling study,
an overview should be given at this point in the presentation. Explain which features were eliminated
and why they were eliminated prior to building the models. During model building, what features
were eliminated, e.g., using forward selection, backward elimination, Lasso Regression, dimensionality
reduction, etc. Also address and quantify the relative importance of the remaining features. Explain
how features that categorical are handled.

8. Reporting of Results. First the experimental setup should be described in sufficient detail to
facilitate reproducibility of your results. One way to show overall results is to plot predicted responses
ŷ and actual responses y versus the instance index i = 0 until m. Reports are to include the Quality
of Fit (QoF) for the various models and datasets in the form of tables, figures and explanation of the
results. Besides the overall model, for many modeling techniques the importance/significance of model
parameters/variables may be assessed as well. Tables and figures must include descriptive captions
and color/shape schemes should be consistent across figures.

43
9. Interpretation of Results. With the results clearly presented, they need to be given insightful
interpretations. What are the ramifications of the results? Are the modeling techniques useful in
making predictions, classifications or forecasts?

10. Recommendations of Study. The organization that hired your group would like some take home
messages that may result in improvements to the organization (e.g., what to produce, what processes
to adapt, how to market, etc.). A brief discussion of how the study could be improved (possibly leading
to further consulting work) should be given.

44
1.4 Additional Textbooks
More detailed development of this material can be found in textbooks on statistical learning, such as

• “An Introduction to Statistical Learning” (ISL) [85]

• “The Elements of Statistical Learning” (ESL) [72]

• “Mathematics for Machine Learning” (MML) [37]

See Table 1.1 for a mapping between the chapters in the four textbooks.

Table 1.1: Source Material Chapter Mappings

Topic ScalaTion ISL ESL MML


Linear Algebra Ch. 2 - - Ch. 2-5
Probability Ch. 3 - - Ch. 6
Data Management Ch. 4 - - -
Data Preprocessing Ch. 5 - - -
Prediction Ch. 6 Ch. 3, 5, 6 Ch. 3 Ch 8-9
Classification Ch. 7 Ch. 2, 5, 8 Ch. 4, 12, 13, 15 Ch. 8.5
Classification: Continuous Ch. 8 Ch. 4, 8, 9 Ch. 4, 12, 13, 15 Ch. 12
Generalized Linear Models Ch. 9 - - -
Nonlinear Models/Neural Networks Ch. 10 Ch. 7 Ch. 11 -
Time Series/Temporal Models Ch. 11 - - -
Multivariate Time Series Models Ch. 12 - - -
Dimensionality Reduction Ch. 13 Ch. 6, 10 Ch. 14 Ch. 10
Clustering Ch. 14 Ch. 10 Ch. 14 -
Simulation Foundations Ch. 15 - - -
State Space Models Ch. 16 - - -
Event-Oriented Models Ch. 17 - - -
Process-Oriented Models Ch. 18 - - -
Simulation Output Analysis Ch. 19 - - -
Optimization in Data Science Appendix A - -- Ch. 7
Graph Databases and Analytics Appendix C - - -

The next two chapters serve as quick reviews of the two principal mathematical foundations for data
science: linear algebra and probability.

45
46
Part I

Foundations

47
Chapter 2

Linear Algebra

Data science and analytics make extensive use of linear algebra. For example, let yi be the income of the ith
individual and xij be the value of the j th predictor/feature (age, education, etc.) for the ith individual. The
responses (outcomes of interest) are collected into a vector y, the values for predictors/features are collected
in a matrix X and the parameters/coefficients b are fit to the data.

2.1 Linear System of Equations


The study of linear algebra starts with solving systems of equations, e.g.,

y0 = x00 b0 + x01 b1
y1 = x10 b0 + x11 b1

This linear system has two equations with two variables having unknown values, b0 and b1 . Such linear
systems can be used to solve problems like the following: Suppose a movie theatre charges 10 dollars per
child and 20 dollars per adult. The evening attendance is 100, while the revenue is 1600 dollars. How many
children (b0 ) and adults (b1 ) were in attendance?

100 = 1b0 + 1b1


1600 = 10b0 + 20b1

The solution is b0 = 40 children and b1 = 60 adults.


In general, linear systems may be written using matrix notation.

y = Xb (2.1)

where y is an m-dimensional vector, X is an m-by-n dimensional matrix and b is an n-dimensional vector.

49
2.2 Matrix Inversion
If the matrix is of full rank with m = n, then the unknown vector b may be uniquely determined by
multiplying both sides of the equation by the inverse of X, X −1

b = X −1 y (2.2)

Multiplying matrix X and its inverse X −1 , X −1 X results in an n-by-n identity matrix In = [1i=j ], where
the indicator function 1i=j equals 1 when i = j and 0 otherwise.
A faster and more numerically stable way to solve for b is to perform Lower-Upper (LU) Factorization.
This is done by factoring matrix X into lower L and upper U triangular matrices.

X = LU (2.3)

Then LU b = y, so multiplying both sides by L−1 gives U b = L−1 y. Taking an augmented matrix
" #
1 3 1
2 1 7

and performing row operations to make it upper right triangular has the effect of multiplying by L−1 . In
this case, the first row multiplied by -2 is added to second row to give.
" #
1 3 1
0 −5 5
From this, backward substitution can be used to determine b1 = −1 and then that b0 = 4, i.e.,
 
4
b =
−1
In cases where m > n, the system may be overdetermined, and no solution will exist. Values for b are
then often determined to make y and Xb agree as closely as possible, e.g., minimize absolute or squared
differences.
Vector notation is used in this book, with vectors shown in boldface and matrices in uppercase. Note,
matrices in ScalaTion are in lowercase, since by convention, uppercase indicates a type, not a variable.
ScalaTion supports vectors and matrices in its mathstat package. A commonly used operation is the dot
(inner) product, x · y, or in ScalaTion, x dot y.

50
2.3 Vector
A vector may be viewed a point in multi-dimensional space, e.g., in three space, we may have

x = [x0 , x1 , x2 ] = [0.57735, 0.55735, 0.57735]


y = [y0 , y1 , y2 ] = [1.0, 1.0, 0.0]

where x is a point on the unit sphere and y is a point in the plane determined by the first two coordinates.

2.3.1 Vector Addition and Subtraction


Vectors may be added (x + y) and subtracted (x − y). For example, [1, 2] + [3, 4] = [4, 6].

2.3.2 Element-wise Multiplication and Division


Vectors may be multiplied element-by-element (like a Hadamard product) (x ∗ y), and divided element-by-
element (x/y). These operations are also supported when one of the arguments is a scalar.

2.3.3 Vector Dot Product


A particularly important operation, the dot product (or inner product) of two vectors is simply the sum of
the products of their elements.

n−1
X
x·y = xi yi = 1.1547 (2.4)
i=0

Note, the inner product applies more generally, e.g., hx, yi may be applied when x and y are infinite sequences
or functions. See the exercises for the definition of an inner product space.

2.3.4 Norm
The norm of a vector is its length. Assuming Euclidean distance, the norm is
v
un−1
uX
kxk = t x2i = 1 (2.5)
i=0

The norm of y is 2. If θ is the angle between the x and y vectors, then the dot product is the product of
their norms and the cosine of the angle.

x · y = kxkkyk cos(θ) (2.6)

Thus, the cosine of θ is,

x·y 1.1547
cos(θ) = = √ = 0.8165
kxkkyk 1· 2
so the angle θ = .616 radians. Vectors x and y are orthogonal if the angle θ = π/2 radians (90 degrees).
In general there are `p norms. The two that are used here are the `2 norm kxk = kxk2 (Euclidean
distance) and the `1 norm kxk1 (Manhattan distance).

51
n−1
X
kxk1 = |xi | (2.7)
i=0

Vector notation facilitates concise mathematical expressions. Many common statistical measures for
populations or samples can be given in vector notation. For an m dimensional vector (m-vector) the following
may be defined.

1·x
µ(x) = µx =
m
(x − µx ) · (x − µx )
σ 2 (x) = σx2 =
m
x·x
= − µ2x
m
(x − µx ) · (y − µy )
σ(x, y) = σx,y =
m
x·y
= − µx µy
m
σx,y
ρ(x, y) = ρx,y =
σx σy

which are the population mean, variance, covariance and correlation, respectively.
The size of the population is m, which corresponds to the number of elements in the vector. A vector of
all ones is denoted by 1. For an m-vector k1k2 = 1 · 1 = m. Note, the sample mean uses the same formula,
while the sample variance and covariance divide by m − 1, rather than m (sample indicates that only some
fraction of population is used in the calculation).
Vectors may be used for describing the motion of an object through space over time. Let u(t) be the
location of an object (e.g., golf ball) in three dimensional space R3 at time t,

u(t) = [x(t), y(t), z(t)]

To describe the motion, let v(t) be the velocity at time t, and a be the constant acceleration, then according
to Newton’s Second Law of Motion,

1
u(t) = u(0) + v(0) t + a t2
2
The time varying function u(t) over time will show the trajectory of the golf ball.

2.3.5 Vector Operations in ScalaTion


Vector operations are implemented by multiple classes, such as the VectorD class.
1 @ param dim the dimension / size of the vector
2 @ param v the 1 D array used to store vector elements
3

4 class VectorD ( val dim : Int ,


5 private [ mathstat ] var v : Array [ Double ] = null )
6 extends IndexedSeq [ Double ]
7 with P a r t i a l l y O r d e r e d [ VectorD ]

52
8 with Cloneable [ VectorD ]
9 with D e f a u l t S e r i a l i z a b l e :

VectorD includes methods for size, indices, set, copy, filter, select, concatenate, vector arithmetic, power,
square, reciprocal, abs, sum, mean variance, rank, cumulate, normalize, dot, norm, max, min, mag, argmax,
argmin, indexOf, indexWhere, count, contains, sort and swap.

Table 2.1: Vector Arithmetic Operations

op vector op vector vector op scalar vector element op scalar


+ def + (b: VectorD): VectorD def + (s: Double): VectorD def + (s: (Int, Double)): VectorD
+= def += (b: VectorD): VectorD def += (s: Double): VectorD -
- def - (b: VectorD): VectorD def - (s: Double): VectorD def - (s: (Int, Double)): VectorD
-= def -= (b: VectorD): VectorD def -= (s: Double): VectorD -
* def * (b: VectorD): VectorD def * (s: Double): VectorD def * (s: (Int, Double)): VectorD
*= def *= (b: VectorD): VectorD def *= (s: Double): VectorD -
/ def / (b: VectorD): VectorD def / (s: Double): VectorD def / (s: (Int, Double)): VectorD
/= def /= (b: VectorD): VectorD def /= (s: Double): VectorD -

53
2.4 Vector Calculus
Data science uses optimization to fit parameters in models, where for example a quality of fit measure (e.g.,
sum of squared errors) is minimized. Typically, gradients are involved. In some cases, the gradient of the
measure can be set to zero allowing the optimal parameters to be determined by matrix factorization. For
complex models, this may not work, so an optimization algorithm that moves in the direction opposite to
the gradient can be applied.

2.4.1 Gradient Vector


Consider the following function f : R2 → R of vector u = [x, y],

f (u) = (x − 2)2 + (y − 3)2

For example, the functional value at the point [3, 2], f ([3, 2]) = 1 + 1 = 2 and at the point [1, 1], f ([1, 1]) =
1 + 4 = 5. The following contour curves illustrate the how the elevation of the function increases with
distance from the point [2, 3].

y
5
4
3
2
1

−3 −2 −1 1 2 3 4 5 x
−1
−2
−3

Figure 2.1: Contour Curves for Function f : elevation = 1, 2, 3, 4 and 5

The gradient of function f consists of a vector formed from the two partial derivatives.
 
∂f ∂f
grad f = ∇f = ,
∂x ∂y
The gradient evaluated at point/vector u ∈ R2 is
 
∂f ∂f
∇f (u) = (u), (u)
∂x ∂y
The gradient indicates the direction of steepest increase/ascent. For example, the gradient at the point [3, 2],
∇f ([3, 2]) = [2, −2] (in blue), while at [1, 1], ∇f ([1, 1]) = [−2, −4] (in purple).

54
A gradient’s norm indicates the magnitude of the rate of change (or steepness). When the elevation
changes are fixed (here they differ by one), the closeness of the contours curves also indicates steepness.
Notice that the gradient vector at point [x, y] is orthogonal to the contour curve intersecting that point.
By setting the gradient equal to zero, in this case

∂f
= 2(x − 2)
∂x
∂f
= 2(y − 3)
∂y

one may find the vector that minimizes function f , namely u = [2, 3] where f = 0. For more complex
functions, repeatedly moving in the opposite direction to the gradient, may lead to finding a minimal value.
In general, the gradient (or gradient vector) of function f : Rn → R is
 
∂f ∂f ∂f
∇f = = ,..., (2.8)
∂x ∂x0 ∂xn−1

or evaluated at point/vector x ∈ Rn is
 
∂f ∂f ∂f
∇f (x) = (x) = (x), . . . , (x) (2.9)
∂x ∂x0 ∂xn−1

In data science, it is often convenient to take the gradient of a dot product of two functions of x, in which
case the following product rule can be applied.

∇(f (x) · g(x)) = ∇f (x) · g(x) + f (x) · ∇g(x) (2.10)

2.4.2 Jacobian Matrix


The Jacobian Matrix is an extension of the gradient vector to the case where the value of the function is
multi-dimensional, i.e., f = [f0 , f1 , . . . , fm−1 ]. In general, the Jacobian of function f : Rn → Rm of vector
x ∈ Rn is
 
∂fi
Jf (x) = = (2.11)
∂xj 0≤i<m,0≤j<n

 
∇f0 (x)
 ∇f (x) 
 1 
 
 ... 
∇fm−1 (x)
This follows the numerator layout where the functions correspond to rows (the opposite is called the denom-
inator layout which is the transpose of the numerator layout).
Consider the following function f : R2 → R2 that maps vectors in R2 into other vectors in R2 .

f (x) = [(x0 − 2)2 + (x1 − 3)2 , (2x0 − 6)2 + (3x1 − 6)2 ]

The Jacobian of the function, Jf (x), is

55
 ∂f ∂f0 
0
,
 ∂x0 ∂x1 
 
 ∂f ∂f 
1 1
,
∂x0 ∂x1
Taking the partial derivatives gives the following Jacobian matrix.
" #
2x0 − 4, 2x1 − 6)
4x0 − 12, 6x1 − 12

2.4.3 Hessian Matrix


While the gradient is a vector of first partial derivatives, the Hessian is a symmetric matrix of second partial
derivatives. The Hessian Matrix of a scalar-valued function f : Rn → R of vector x ∈ Rn is

∂2f
 
Hf (x) = (2.12)
∂xi ∂xj 0≤i<n,0≤j<n

Consider the following function f : R2 → R that maps vectors in R2 into scalars in R.

f (x) = (2x0 − 6)2 + (3x1 − 6)2

The Hessian of the function, Hf (x), is

∂2f ∂2f
 
,

 ∂x20 ∂x0 ∂x1 

 
2 2
 ∂ f ∂ f 
,
∂x1 ∂x0 ∂x21
Taking the second partial derivatives gives the following Hessian matrix.
" #
4, 0
0, 6
Consider a differentiable function of n variables, f : Rn → R. The points at which its gradient vector ∇f
is zero are referred to as critical points. In particular, they may be local minima, local maxima or saddle
points/inconclusive, depending on whether the Hessian matrix H is positive definite, negative definite, or
|
otherwise. A symmetric matrix A is positive definite if x Ax > 0 for all x 6= 0 (alternatively, all of A’s
eigenvalues are positive). Note: a positive/negative semi-definite Hessian matrix may or may not indicate
an optimal (minimal/maximal) point.

56
2.5 Matrix
A matrix may be viewed as a collection of vectors, one for each row in the matrix. Matrices may be used to
represent linear transformations

f : Rn → Rm (2.13)
n m
that map vectors in R to vectors in R . For example, in ScalaTion an m-by-n matrix A with m = 3
rows and n = 2 columns may be created as follows:
1 val a = MatrixD ((3 , 2) , 1 , 2 ,
2 3, 4,
3 5 , 6)

to produce matrix A.
 
1 2
3 4
 

5 6
Matrix A will transform u vectors in R2 into v vectors in R3 .

Au = v (2.14)
For example,

 
" # 5
1
A = 11
 
2
17
ScalaTion supports retrieval of row vectors, column vectors and matrix elements. In particular, the
following access operations are supported.

A = a = matrix
ai = a(i) = row vector i
a:j = a(?, j) = column vector j
aij = a(i, j) = the element at row i and column j
Ai:k,j:l = a(i to k, j to l) = row and column matrix slice

Note, i to k does not include k. Common operations on matrices are supported as well.

2.5.1 Matrix Operation in ScalaTion


Matrix operations in ScalaTion are implemented in the MatrixD class for dense matrices.
1 @ param dim the first ( row ) dimension of the matrix
2 @ param dim2 the second ( column ) dimension of the matrix
3 @ param v the 2 D array used to store matrix elements
4

5 class MatrixD ( val dim : Int ,


6 val dim2 : Int ,
7 private [ mathstat ] var v : Array [ Array [ Double ]] = null ) :

57
Matrix Addition and Subtraction

Matrix addition val c = a + b

cij = aij + bij (2.15)

and matrix subtraction val c = a - b are supported.

Matrix Multiplication

A frequently used operation in data science is matrix multiplication val c = a * b.

n−1
X
cij = aik bkj = ai · b:j (2.16)
k=0

Mathematically, this is written as C = AB. The ij element in matrix C is the vector dot product of the ith
row of A with the j th column of B.

Matrix Transpose
|
The transpose of matrix A, written A (val t = a.transpose or val t = a.T ), simply exchanges the roles
of rows and columns.
1 def transpose : MatrixD =
2 val a = Array . ofDim [ Double ] ( dim2 , dim )
3 for j <- indices do
4 val v_j = v ( j )
5 var i = 0
6 cfor ( i < dim2 , i + = 1) { a ( i ) ( j ) = v_j ( i ) }
7 end for
8 new MatrixD ( dim2 , dim , a )
9 end transpose

Matrix Determinant

The determinant of square (m = n) matrix A, written |A| (val d = a.det), indicates whether a matrix is
singular or not (and hence invertible), based on whether the determinant is zero or not.

Trace of a Matrix

The trace of matrix A ∈ Rn×n is simply the sum of its diagonal elements.

n−1
X
tr(A) = aii (2.17)
i=0

In ScalaTion, the trace is computed using the trace method (e.g., a.trace).

58
Matrix Dot Product

ScalaTion provides several types of dot products on both vectors and matrices, two of which are shown
below. The first method computes the usual dot product between two vectors. Note, the parameter y is
generalized to take any vector-like data type.
1 def dot ( y : IndexedSeq [ Double ]) : Double =
2 var sum = 0.0
3 for i <- v . indices do sum + = v ( i ) * y ( i )
4 sum
5 end dot

When relevant a n-vector (e.g., x ∈ Rn ) may be viewed as an n-by-1 matrix (column vector), in which case
|
x would be viewed as an 1-by-n matrix (row vector). Consequently, dot product (and outer product) can
be defined in terms of matrix multiplication and transpose operations.

|
x·y =x y dot (inner) product (2.18)
|
x ⊗ y = xy outer product (2.19)

The second method takes the dot product two matrices. The second method extends the notion of
|
matrices and is an efficient way to compute A B = A · B = a.transpose * b = a dot b.
1 def dot ( y : MatrixD ) : MatrixD =
2 if dim2 ! = y . dim then
3 flaw ( " dot " , s " matrix dot matrix - incompatible cross dimensions :
4 dim2 = $dim2 , y . dim = ${ y . dim } " )
5

6 val a = Array . ofDim [ Double ] ( dim , y . dim )


7 for ii <- 0 until dim by TSZ do
8 for jj <- 0 until y . dim2 by TSZ do
9 for kk <- 0 until dim2 by TSZ do
10 val k2 = math . min ( kk + TSZ , dim2 )
11

12 for i <- ii until math . min ( ii + TSZ , dim ) do


13 val v_i = v ( i ) ; val a_i = a ( i )
14 for j <- jj until math . min ( jj + TSZ , y . dim2 ) do
15 val y_j = y . v ( j )
16 var sum = 0.0
17 var k = kk
18 cfor ( k < k2 , k + = 1) { sum + = v_i ( k ) * y_j ( k ) }
19 a_i ( j ) + = sum
20 end for
21 end for
22

23 end for
24 end for
25 end for
26 new MatrixD ( dim , y . dim , a )
27 end dot

59
2.6 Matrix Factorization
Many problems in data science involve matrix factorization to for example solve linear systems of equations
or perform Ordinary Least Squares (OLS) estimation of parameters. ScalaTion supports several factorization
techniques, including the techniques shown in Table 2.2

Table 2.2: Matrix Factorization Techniques

Factorization Factors Factor 1 Factor 2 Class


LU A = LU lower left triangular upper right triangular Fac LU
|
Cholesky A = LL lower left triangular its transpose Fac Cholesky
QR A = QR orthogonal upper right triangular Fac QR
|
SVD A = U ΣV orthogonal diagonal, orthogonal Fac SVD
Eigen A = QΛQ−1 eigenvectors diagonal, inverse eigen Fac Eigen

These algorithms are faster or more numerically stable than algorithms for matrix inversion. See the Pre-
diction chapter to see how matrix factorization is used in Ordinary Least Squares estimation.

2.6.1 Eigenvalues and Eigenvectors


1 0
Consider the following matrix A and two vectors x = 0 and z = 1 .
" #
2 0
0 3

Multiplying A and x yields 20 , while multiplying A and z yields 03 . Thus, letting λ0 = 2 and λ1 = 3, we
   

see that Ax = λ0 x and Az = λ1 z. In general, a matrix An×n of rank r will have r non-zero eigenvalues λi
with corresponding eigenvector x(i) such that

Ax(i) = λi x(i) (2.20)

In other words, there will be r unit eigenvectors, for which multiplying by the matrix simply rescales the
eigenvector x(i) by its eigenvalue λi . The same will happen for any non-zero vector in alignment with one of
the r unit eigenvectors.
Given an eigenvalue λi , an eigenvector may be found by noticing that

Ax(i) − λi x(i) = [A − λi ] x(i) = 0 (2.21)

Any vector in the nullspace of the matrix A − λi I is an eigenvector corresponding to λi . Note, if the above
equation is transposed, it is called a left eigenvalue problem (see the section on Markov Chains).
In low dimensions, the eigenvalues may be found as roots of the characteristic polynomial/equation
derived from taking the determinant of A − λi I. Software like ScalaTion, however, use iterative algorithms
that convert a matrix into Hessenburg and tridiagonal forms.

60
2.7 Internal Representation
The current internal representation used for storing the elements in a dense matrix is Array [Array [Double]]
in row major order (row-by-row). Depending on usage, operations may be more efficient using column ma-
jor order (column-by-column). Also, using a one dimensional array Array [Double] mapping (i, j) to the
k th location may be more efficient. Furthermore, having operations access through sub-matrices (blocks)
may improve performance because of caching efficiency or improved performance for parallel and distributed
versions.
The mathstat package provides several classes implementing multiple types of vectors and matrices as
shown in Table 2.3 including VectorD and MatrixD.

Table 2.3: Types of Vectors and Matrices: Implementing Classes

trait VectorD MatrixD


dense VectorD MatrixD
sparse SparseVectorD SparseMatrixD
compressed RleVectorD RleMatrixD
tridiagonal - SymTriMatrixD
bidiagonal - BidMatrixD

The suffix ‘D’ indicates the base element type is Double. There are also implementations for Complex ‘C’,
Int ‘I’, Long ‘L’, Rational ‘Q’, Real ‘R’, String ‘S’, and TimeNum ‘T’.
Note, ScalaTion 2.0 currently only supports dense vectors and matrices. See older versions for the
other types of vectors and matrices.
ScalaTion supports many operations involving matrices and vectors, including the following show in
Table 2.5.

Table 2.4: Types of Vector and Matrix Products

Product Method Example in Math


vector dot def dot (y: VectorD): Double x dot y x·y
vector element-wise def * (y: VectorD): VectorD x*y xy
vector outer def outer (y: VectorD): MatrixD x outer y x⊗y
matrix mult def * (y: MatrixD): MatrixD x*y XY
|
matrix mdot def dot (y: MatrixD): MatrixD x dot y X Y
matrix vector def * (y: VectorD): VectorD x*y Xy
matrix vector def * (y: VectorD): MatrixD x* y X diag(y)

61
2.8 Tensor
Loosely speaking, a tensor is a generalization of scalar, vector and matrix. The order of the tensor indicates
the number dimensions. In this text, tensors are treated as hyper-matrices and issues such as basis inde-
pendence, contravariant and covariant vectors/tensors, and the rules for index notation involving super and
subscripts are ignored [111]. To examine the relationship between order 2 tensors and matrices more deeply,
see the last exercise.
For data science, input into a model may be a vector (e.g., simple regression, univariate time series), a
matrix (e.g., multiple linear regression, neural networks), a tensor with three dimensions (e.g., monochro-
matic/greyscale images), and a tensor with four dimensions (e.g., color images).

Table 2.5: Tensors of Different Orders

Order Analog/Name Example


zeroth scalar FICA score
first vector customer financial record
second matrix collection of financial records
third tensor collection of grayscale images
fourth tensor4 collection color images

2.8.1 Three Dimensional Tensors


In ScalaTion, tensors with three dimensions are supported by the TensorD class. The default names for
the dimensions [111] were chosen to follow a common convention (row, column, sheet). In data science, the
first index usually indicates which instance, e.g., ith element of a vector, ith row of a matrix, ith row of a
tensor.
1 @ param dim size of the 1 st level / dimension ( row ) of the tensor ( height )
2 @ param dim2 size of the 2 nd level / dimension ( column ) of the tensor ( width )
3 @ param dim3 size of the 3 rd level / dimension ( sheet ) of the tensor ( depth )
4

5 class TensorD ( val dim : Int , val dim2 : Int , val dim3 : Int ,
6 private [ mathstat ] var v : Array [ Array [ Array [ Double ]]] = null )
7 extends Error with Serializable

A tensor T is stored in a triple array [tijk ]. Below is an example of a 2-by-2-by-2 tensor, T = [T::0 |T::1 ]
" #
t000 t010 | t001 t011
t100 t110 | t101 t111

where each sheet T::k is a 2-by-2 matrix.


Note, ScalaTion allows the default names for the dimensions to be changed, so they are more sug-
gestive given the application, e.g., (row, column, channel) for one color image or (sheet, row, column) for
spreadsheets.
Ragged order 3 tensors RTensorD are also supported which allow the middle dimension to vary (be
ragged).

62
2.8.2 Four Dimensional Tensors
In ScalaTion, tensors with four dimensions are supported by the Tensor4D class. The default names for
the dimensions [111] were chosen to follow a common convention (row, column, sheet, channel).
1 @ param dim size of the 1 st level / dimension ( row ) of the tensor ( height )
2 @ param dim2 size of the 2 nd level / dimension ( column ) of the tensor ( width )
3 @ param dim3 size of the 3 rd level / dimension ( sheet ) of the tensor ( depth )
4 @ param dim3 size of the 4 rd level / dimension ( channel ) of the tensor ( spectra )
5

6 class Tensor4D ( val dim : Int , val dim2 : Int , val dim3 : Int , , dim4 : Int ,
7 private [ mathstat ] var v : Array [ Array [ Array [ Array [ Double ]]]] = null )
8 extends Error with Serializable

Such a tensor T is stored in a quad array [tijkl ].


Ragged order 4 tensors RTensor4D are also supported which allow the middle two dimensions to vary (be
ragged).

63
2.9 Exercises
1. Draw two 2-dimensional non-zero vectors, x and y, whose dot product x · y is zero.
x
2. A vector can be transformed into a unit vector in the same direction by dividing by its norm, .
kxk
Let, y = 2x and show that the dot of the corresponding unit vectors equals one. This means that their
Cosine Similarity equals one.

x·y
cosxy = cos(θ) = where θ is the angle between the vectors
kxkkyk

When would the Cosine Similarity be -1? When would it be 0?

3. Correlation ρxy vs. Cosine Similarity cosxy . What does it mean when the correlation (cosine similarity)
is 1, 0, -1, respectively. In general, does ρxy = cosxy ? What about in special cases?

4. Given the matrix X and the vector y, solve for the vector b in the equation y = Xb using matrix
inversion and LU factorization.
1 import scalation . mathstat .{ MatrixD , VectorD , Fac_LU }
2 val x = MatrixD ((2 , 2) , 1 , 3 ,
3 2 , 1)
4 val y = VectorD (1 , 7)
5 println ( " using inverse : b = X ˆ -1 y = " + x . inverse * y )
6 println ( " using LU fact : Lb = Uy = " + { val lu = new Fac_LU ( x ) ; lu . factor () . solve
(y) } )

Modify the code to show the inverse matrix X −1 and the factorization into the L and U matrices.
| |
5. If Q is an orthogonal matrix, then Q Q becomes what type of matrix? What about QQ ? Illustrate
with an example 3-by-3 matrix. What is the inverse of Q?

6. Show that the Hessian matrix of a scalar-valued function f : Rn → R is the transpose of the Jacobian
of the gradient, i.e.,

|
Hf (x) = [J ∇f (x)]

7. Critical points for a function f : Rn → R occur when ∇f (x) = 0. How can the Hessian Matrix can be
used to decide whether a particular critical point is a local minimum or maximum?

8. Define three functions, f1 (x, y), f2 (x, y) and f3 (x, y), that have critical points (zero gradient) at the
point [2, 3] such that this point is (a) a minimal point, (b) a maximal point, (c) a saddle point,
respectively. Compute the Hessian matrix at this point for each function and use it to explain the type
of critical point. Plot the three surfaces in 3D.
Hint: see https://www.math.usm.edu/lambers/mat280/spr10/lecture8.pdf

9. Determine the eigenvalues for the matrix A given in the section on eigenvalues and eigenvectors, by
setting the determinant of A − λI equal to zero.

64
" #
2−λ 0
0 3−λ

to obtain the following characteristics polynomial.

(2 − λ)(3 − λ) − 0 = 0

Solve for all roots of this polynomial to determine the eigenvalues.

10. A vector space V over field K (e.g., R or C) is a set of objects, e.g., vectors x, y, and z, and two
operations, addition and scalar multiplication,

x, y ∈ V =⇒ x + y ∈ V (2.22)
x ∈ V and a ∈ K =⇒ ax ∈ V (2.23)

satisfying the following conditions/axioms

(x + y) + z = x + (y + z)
x+y = y+x
∃ 0 ∈ V s.t. x + 0 = x
∃ − x ∈ V s.t. x + (−x) = 0
(ab)x = a(bx)
a(x + y) = ax + ay
(a + b)x = ax + bx
∃1 ∈ K s.t. 1x = x

Give names to these axioms and illustrate them with examples.

11. A normed vector space V over field K is a vector space with a function defined that gives the length
(norm) of a vector,

x ∈ V =⇒ kxk ∈ R+

satisfying the following conditions/axioms

kaxk = |a| kxk


kxk > 0 unless x = 0
kx + yk ≤ kxk + kyk

A norm induced metric called distance can be defined,

65
d(x, y) = kx − yk

The `p -norm is defined as follows:

n−1
! p1
X
kxkp = |xi |p
i=0

Norms and distances are very useful in data science, for example, loss functions used to judge/optimize
models are often defined in terms of norms or distances.
Show that the last axiom called the triangle inequality hold for `2 -norms.
Hint: kxk22 is the sum of the elements in x squared.

12. An inner product space H over field K is a vector space with one more operation, inner product,

x, y ∈ H =⇒ hx, yi ∈ K

satisfying the following conditions/axioms


hx, yi = hy, xi
hax + by, zi = a hx, zi + b hy, zi
hx, xi > 0 unless x = 0

Note, the complex conjugate negates the imaginary part of a complex number, e.g., (c + di)∗ = c − di
Show that an n-dimensional Euclidean vector space using the definition of dot product given in this
chapter is an inner product space over R.

13. Explain the meaning of the following statement, “a tensor of order 2 for a given coordinate system can
be represented by a matrix.”
Hint: see “Tensors: A Brief Introduction” [32]

66
2.10 Further Reading
1. Introduction to Applied Linear Algebra: Vectors, Matrices, and Least Squares [21]

2. Matrix Computations [58]

3. Tensors and Hypermatrices [111]

67
68
Chapter 3

Probability

Probability is used to measure the likelihood of certain events occurring, such as flipping a coin and getting a
head, rolling a pair of dice and getting a sum of 7, or getting a full house in five card draw. Given a random
experiment, the sample space Ω is the set of all possible outcomes.

3.1 Probability Measure


Definition: A probability measure P can be defined axiomatically as follows:

P (A) ≥ 0 for any event A ⊆ Ω


P (Ω) = 1 (3.1)
X
P (∪Ai ) = P (Ai ) for a countable collection of disjoint events

Technically speaking, an event is a measurable subset of Ω (see [41] for a measure-theoretic definition).
Letting F be the set of all possible events, one may define a probability space as follows:
Definition: A probability space is defined as a triple (Ω, F, P ).
Given an event A ∈ F, the probability of its occurrence is restricted to the unit interval, P (A) ∈ [0, 1].
Thus, P may be viewed as a function that maps events to the unit interval.

P : F → [0, 1] (3.2)

3.1.1 Joint Probability


Given two events A and B, the joint probability of their co-occurrence is denoted by

P (AB) = P (A ∩ B) ∈ [0, min(P (A), P (B))] (3.3)

If events A and B are independent, simply take the product of the individual probabilities,

P (AB) = P (A)P (B) (3.4)

69
3.1.2 Conditional Probability
The conditional probability of the occurrence of event A, given it is known that event B has occurred/will
occur is

P (AB)
P (A|B) = (3.5)
P (B)
If events A and B are independent, the conditional probability reduces to

P (AB) P (A)P (B)


P (A|B) = = = P (A) (3.6)
P (B) P (B)
In other words, the occurrence of event B has no effect on the probability of event A occurring.

Bayes Theorem

An important theorem involving conditional probability is Bayes Theorem.

P (B|A)P (A)
P (A|B) = (3.7)
P (B)
When determining conditional probability A|B is difficult, one may try going the other direction and first
determine B|A.

Example

Consider flipping two coins. What is the Sample/Outcome Space Ω?

Ω = {[T, T ], [T, H], [H, T ], [H, H]} = {ω1 , ω2 , ω3 , ω4 }

The size of the outcome space is 4 and since the event space F contains all subsets of Ω, its size is 24 = 16.
Define the following two events:

• event A = first coin showing heads and

• event B = at least one head was rolled.

What is the probability that event A occurred, given that you know that event B occurred? If fair coins are
used, the probability of a head (or tail) is 1/2 and the probabilities reduce to the ratios of set sizes.

P (AB) |A ∩ B| |{ω3 , ω4 } ∩ {ω2 , ω3 , ω4 }|


P (A|B) = = = = 2/3
P (B) |B| |{ω2 , ω3 , ω4 }|
This simplification can be done whenever all outcomes are equi-probable.

70
3.2 Random Variable
Rather than just looking at individual events, e.g., E1 or E2 , one is often more interested in the probability
that random variables take on certain values.
Definition: A random variable y is a function that maps outcomes in the sample space Ω into a set/domain
of numeric values Dy .

y : Ω → Dy (3.8)

Some commonly used domains are real numbers R, integers Z, natural numbers N, or subsets thereof. An
example of a mapping from outcomes to numeric values is tail → 0, head → 1. In other cases such as the
roll of one dice, the map is the identity function.
One may think of a random variable y (blue font) as taking on values from a given domain Dy . With a
random variable, its value is uncertain, i.e., its value is only known probabilistically.
For A ⊆ Dy one can measure the probability of the random variable y taking on a value from the set
A. This is denoted by

P (y ∈ A) (3.9)

This really means the probability of event E which maps to set A

E = y −1 (A) (3.10)
P (E) (3.11)

where y −1 (A) is the inverse image of A.

3.2.1 Discrete Random Variable


A discrete random variable is defined on finite or countably infinite domains. For example, the probability
of rolling a natural in dice (sum of 7 or 11 with two dice) is given by

P (y ∈ {7, 11}) = 6/36 + 2/36 = 8/36 = 2/9

3.2.2 Continuous Random Variable


A continuous random variable is defined on uncountably infinite domains. For example, the probability of
my tee shot on a par-3 golf hole ending up within 10 meters of the hole is 0.1 or 10 percent.

P (y ∈ [0, 10]) = 0.1

71
3.3 Probability Distribution
A random variable y is characterized by how its probability is distributed over its domain Dy . This can be
captured by functions that map Dy to R+ .

3.3.1 Cumulative Distribution Function


The most straightforward way to do this is to examine the probability measure for a random variable in
terms of a Cumulative Distribution Function (CDF).

Fy : Dy → [0, 1] (3.12)

It measures the amount probability or mass accumulated over the domain up to and including the point y.
The color highlighted symbol y is the random variable, while y simply represents a value.

Fy (y) = P (y ≤ y) (3.13)

To illustrate the concept, let x1 and x2 be the number on dice 1 and dice 2, respectively. Let y = x1 + x2 ,
then Fy (6) = P (y ≤ 6) = 5/12. The entire CDF for the discrete random variable y (roll of two dice), Fy (y)
is

{(2, 1/36), (3, 3/36), (4, 6/36), (5, 10/36), (6, 15/36), (7, 21/36), (8, 26/36), (9, 30/36), (10, 33/36), (11, 35/36), (12, 36/36)}

As another example, the CDF for a continuous random variable y that is defined to be uniformly distributed
on the interval [0, 2] is

y
on [0, 2]
Fy (y) =
2
When random variable y follows this CDF, we may say that y is distributed as Uniform (0, 2), symbolically,
y ∼ Uniform (0, 2).

3.3.2 Probability Mass Function


While the CDF indicates accumulated probability or mass (totaling 1), examining probability or mass locally
can be more informative. In case the random variable is discrete (i.e., Dy is discrete), a probability mass
function (pmf) may be defined.

py : Dy → [0, 1] (3.14)

This function indicates the amount of mass/probability at point yi ∈ Dy ,

py (yi ) = Fy (yi ) − Fy (yi−1 ) (3.15)

It can be calculated as the first difference of the CDF, i.e., the amount of accumulated mass at point yi
minus the amount of accumulated mass at the previous point yi−1 .
For one dice x1 , the pmf is

{(1, 1/6), (2, 1/6), (3, 1/6), (4, 1/6), (5, 1/6), (6, 1/6)}

72
A second dice x2 will have the same pmf. Both random variables follow the Discrete Uniform Distribution,
Randi (1, 6).

1
px (x) = 1{1≤x≤6} (3.16)
6

where 1{c} is the indicator function (if c then 1 else 0).


If the two random variables are added y = x1 + x2 , the pmf for the random variable y (roll of two dice),
py (y) is

{(2, 1/36), (3, 2/36), (4, 3/36), (5, 4/36), (6, 5/36), (7, 6/36), (8, 5/36), (9, 4/36), (10, 3/36), (11, 2/36), (12, 1/36)}

The random variable y follows the Discrete Triangular Distribution (that peaks in the middle) and not the
flat Discrete Uniform Distribution.

min(y − 1, 13 − y)
py (y) = 1{2≤y≤12} (3.17)
36

Using the absolute value, this may be written as follows:

6 − |7 − y|
py (y) = for y ∈ {2, . . . , 12} (3.18)
36

3.3.3 Probability Density Function

Suppose y is defined on the continuous domain, e.g., Dy = [0, 2], and that mass/probability is uniformly
spread amongst all the points in the domain. In such situations, it is not productive to consider the mass at
one particular point. Rather one would like to consider the mass in a small interval and scale it by dividing
by the length of the interval. In the limit this is the derivative which gives the density. For a continuous
random variable, if the function Fy is differentiable, a probability density function (pdf) may be defined.

f y : D y → R+ (3.19)

It is calculated as the first derivative of the CDF, i.e.,

dFy (y)
fy (y) = (3.20)
dy

For example, the pdf for a uniformly distributed random variable y on [0, 2] is

d y 1
fy (y) = = on [0, 2]
dy 2 2

The pdf for the Uniform Distribution is shown in the figure below.

73
pdf for Uniform Distribution
0.6

0.55

fy (y)
0.5

0.45

0 0.5 1 1.5 2
y

Random variates of this type may be generated using ScalaTion’s Uniform (0, 2) class within the
scalation.random package.
1 val rvg = Uniform (0 , 2)
2 val yi = rvg . gen

For another example, the pdf for an exponentially distributed random variable y on [0, ∞) with rate
parameter λ is

fy (y) = λe−λy on [0, ∞)

The pdf for the Exponential (λ = 1) Distribution is shown in the figure below.

pdf for Exponential Distribution

0.8

0.6
fy (y)

0.4

0.2

0
0 1 2 3 4
y

Going the other direction, the CDF Fy (y) can be computed by summing the pmf py (y)

74
X
Fy (y) = py (xi ) (3.21)
xi ≤y

or integrating the pdf fy (y).


Z y
Fy (y) = fy (x)dx (3.22)
−∞

75
3.4 Empirical Distribution
An empirical distribution may be used to describe a dataset probabilistically. Consider a dataset (X, y)
where X ∈ Rm×n is the data matrix collected about the predictor variables and y ∈ Rm is the data vector
collected about the response variable. In other words, the dataset consists of m instances of an n-dimensional
predictor vector xi and a response value yi .
The joint empirical probability mass function (epmf) may be defined on the basis of a given dataset
(X, y).

m−1
ν(x, y) 1 X
pdata (x, y) = = 1{xi =x,yi =y} (3.23)
m m i=0

where ν(x, y) is the frequency count and 1{c} is the indicator function (if c then 1 else 0).
The corresponding Empirical Cumulative Distribution Function (ECDF) may be defined as follows:

m−1
1 X
Fdata (x, y) = 1{xi ≤x,yi ≤y} (3.24)
m i=0

76
3.5 Expectation
Using the definition of a CDF, one can determine the expected value (or mean) for random variable y using
a Riemann-Stieltjes integral.
Z
E [y] = y dFy (y) (3.25)
Dy

The mean specifies the center of mass, e.g., a two-meters rod with the mass evenly distributed throughout,
would have a center of mass at 1 meter. Although it will not affect the center of mass calculation, since the
total probability is 1, unit mass is assumed (one kilogram). The center of mass is the balance point in the
middle of the bar.

3.5.1 Continuous Case


When y is a continuous random variable, we may write the mean as follows:
Z
E [y] = y fy (y)dy (3.26)
Dy

The mean of y ∼ Uniform (0, 2) is


Z 2
1
E [y] = y dy = 1.
0 2

3.5.2 Discrete Case


When y is a discrete random variable, we may write

X
E [y] = y py (y) (3.27)
y∈Dy

The mean for rolling two dice is E [y] = 7. One way to interpret this is to imagine winning y dollars by
playing a game, e.g., two dollars for rolling a 2 and twelve dollars for rolling a 12, etc. The expected earnings
when playing the game once is seven dollars. Also, by the law of large numbers, the average earnings for
playing the game n times will converge to seven dollars as n gets large.

3.5.3 Variance
The variance of random variable y is given by

V [y] = E (y − E [y])2
 
(3.28)

The variance specifies how the mass spreads out from the center of mass. For example, the variance of y ∼
Uniform (0, 2) is
Z 2
1 1
V [y] = E (y − 1)2 = (y − 1)2
 
dy =
0 2 3

77
That is, the variance of the one kilogram, two-meter rod is 13 kilogram meter2 . Again, for probability to
be viewed as mass, unit mass (one kilogram) must be used, so the answer may also be given as 13 meter2
Similarly to interpreting the mean as the center of mass, the variance corresponds to the moment of inertia.
The standard deviation is simply the square root of variance.
p
SD [y] = V [y] (3.29)
For the two-meter rod, the standard deviation is √13 = 0.57735. The percentage of mass within one
standard deviation unit of the center of mass is then 58%. Many distributions, such as the Normal (Gaussian)
distribution concentrate mass closer to the center. For example, the Standard Normal Distribution has the
following pdf.

1 2
fy (y) = √ e−y /2 (3.30)

The mean for this distribution is 0, while the variance is 1. The percentage of mass within one standard
deviation unit of the center of mass is 68%. The pdf for the Normal (µ = 0, σ 2 = 1) Distribution is shown
in the figure below.

pdf for Standard Normal Distribution

0.4

0.3
fy (y)

0.2

0.1

0
−3 −2 −1 0 1 2 3
y
 
Note, the uncentered variance (or mean square) of the random variable y is simply E y 2 .

3.5.4 Covariance
The covariance of two random variable x and y is given by

C [x, y] = E [(x − E [x])(y − E [y])] (3.31)


The covariance specifies whether the two random variables have similar tendencies. If the random variables
are independent, the covariance will be zero, while similar tendencies show up as positive covariance and
dissimilar tendencies as negative covariance. Correlation normalizes covariance to the domain [−1, 1]. Co-
variance can be extended to more than two random variables. Let z be a vector of k random variables, then
a covariance matrix is produced.

78
 
C [z] = C [zi , zj ] 0≤i,j<k (3.32)

79
3.6 Algebra of Random Variables
When random variables x1 and x2 are added to create a new random variable y,

y = x1 + x2

how is y described in terms of mean, variance and probability distribution? Also, what happens when a
random variable is multiplied a constant?

y = ax

3.6.1 Expectation is a Linear Operator


The expectation/mean of the sum is simply the sum of the means.

E [y] = E [x1 ] + E [x2 ] (3.33)

The expectation of a random variable multiplied by a constant, is the constant multiplied by the random
variable’s expectation.

E [ay] = aE [y] (3.34)

The last two equations imply that expectation is a linear operator.

3.6.2 Variance is not a Linear Operator


The variance of the sum is the sum of variances plus twice the covariance.

V [y] = V [x1 ] + V [x2 ] + 2 C [x1 , x2 ] (3.35)

When the random variable are independent, the covariance is zero, so the variance of sum is just the sum of
variances.

V [y] = V [x1 ] + V [x2 ] (3.36)

The variance of a random variable multiplied by a constant, is the constant squared multiplied by the random
variable’s variance.

V [ay] = a2 V [y] (3.37)

See the exercises for derivations.

3.6.3 Convolution of Probability Distributions


Determination the new probability distribution of the sum of two random variables is more difficult.

80
Convolution: Discrete Case

Assuming the random variables are independent and discrete, the pmf of the sum py is the convolution of
two pmfs px1 and px2 .

py = px1 ∗ px2 (3.38)

X
py (y) = px1 (x) px2 (y − x) (3.39)
x∈Dx

For example, letting x1 , x2 ∼ Bernoulli(p), i.e., px1 (x) = px (1 − p)1−x on Dx = {0, 1}, gives

X
py (0) = px1 (x) px2 (0 − x) = p2
x∈Dx
X
py (1) = px1 (x) px2 (1 − x) = 2p(1 − p)
x∈Dx
X
py (2) = px1 (x) px2 (2 − x) = (1 − p)2
x∈Dx

which indicates that y ∼ Binomial(p, 2). The pmf for the Binomial(p, n) distribution is
 
n y
py (y) = p (1 − p)n−y (3.40)
y

Consider the sum of two dice, y = x + z, where x, z ∼ DiscreteUniform(1, 6).

Joint Probability for Two Dice

4
z

1 2 3 4 5 6
x

As the joint pmf pxz (xi , zj ) = px (xi )pz (zj ) = 1/36 is constant over all points, the convolution sum for a
particular value of y corresponds to the downward diagonal sum where the dice sum to that value, e.g.,
py (3) = 2/36, py (7) = 6/36.

81
Convolution: Continuous Case

Now, assuming the random variables are independent and continuous, the pdf of the sum fy is the convolution
of two pdfs fx1 and fx2 .

fy = fx1 ∗ fx2 (3.41)

Z
fy (y) = fx1 (x) fx2 (y − x) dx (3.42)
Dx

For example, letting x1 , x2 ∼ Uniform(0, 1), i.e., fx1 (x) = 1 on Dx = [0, 1], gives

Z
for y ∈ [0, 1] fy (y) = fx1 (x) fx2 (y − x)dx = y
[0,y]
Z
for y ∈ [1, 2] fy (y) = fx1 (x) fx2 (y − x)dx = 2 − y
[0,2−y]

which indicates that y ∼ Triangular(0, 1, 2).

3.6.4 Central Limit Theorem


When several random variables are added (or averaged), interesting phenomena occurs, e.g., consider the
distribution of y as the sum of m random variables.

m−1
X
y = xi
i=0
1 1
When xi ∼ Uniform(0, 1) with mean 2 and variance 12 , then for m large enough y will follow a Normal
distribution

y ∼ Normal(µ, σ 2 )
m m
where µ = 2 and σ 2 = 12 . The pdf for the Normal Distribution is

1 1 y−µ 2
fy (y) = √ e− 2 ( σ ) (3.43)
2πσ
For most distributions, summed random variables will be approximately distributed as Normal, as in-
dicated by the Central Limit Theorem (CLT); for proofs see [47, 11]. Suppose xi ∼ F with mean µx and
variance σx2 < ∞, then the sum of m independent and identically distributed (iid) random variables is
distributed as follows:

m−1
X
y = xi ∼ N (mµx , mσx2 ) as m → ∞ (3.44)
i=0

This is one simple form of the CLT. See the exercises of a visual illustration of the CLT.
Similarly, the sum of m independent and identically distributed random variables (with mean µx and
variance σx 2 ) divided by m will also be Normally distributed for sufficiently large m.

82
m−1
1 X
y = xi
m i=0
1
The expectation of y = m mµx = µx , while variance is σx2 /m, so

y ∼ Normal(µx , σx2 /m)

As, E [y] = µx , y can serve as an unbiased estimator of µx . This can be transformed to the Standard Normal
Distribution with the following transformation.

y − µx
z = √ ∼ Normal(0, 1)
σx / m
The Normal distribution is also referred to as the Gaussian distribution. See the exercises for related
distributions: Chi-square, Student’s t and F .

83
3.7 Median, Mode and Quantiles
As stated, the mean is the expected value, a probability weighted sum/integral of the values in the domain of
the random variable. Other ways of characterizing a distribution are based more directly on the probability.

3.7.1 Median
Moving along the distribution, the place at which half of the mass is below you and half is above you is the
median.

1 1
P (y ≤ median) ≥ and P (y ≥ median) ≥ (3.45)
2 2
Given equally likely values (1, 2, 3), the median is 2. Given equally likely values (1, 2, 3, 4), there are two
common interpretations for the median: The smallest value satisfying the above equation (i.e., 2) or the
average of the values satisfying the equation (i.e., 2.5) The median for two dice (with the numbers summed)
which follow the Triangular distribution is 7.

3.7.2 Quantile
The median is also referred to as the half quantile.

1
Q [y] = Fy−1 ( ) (3.46)
2
More generally, the p ∈ [0, 1] quantile is given by

p Q [y] = Fy−1 (p) (3.47)


where Fy−1 is the inverse CDF (iCDF). For example, recall the CDF for Uniform (0, 2) is

y
p = Fy (y) = on [0, 2]
2
Taking the inverse yields the iCDF.

Fy−1 (p) = 2p on [0, 1]


Consequently, the median Q [y] = Fy−1 ( 21 ) = 1.

3.7.3 Mode
Similarly, one may be interested in the mode, which is the average of the points of maximal probability mass.

M [y] = argmax py (y) (3.48)


y∈Dy

The mode for rolling two dice is y = 7. For continuous random variables, it is the average of points of
maximal probability density.

M [y] = argmax fy (y) (3.49)


y∈Dy

For the two-meter rod, the mean, median and mode are all equal to 1.

84
3.8 Joint, Marginal and Conditional Distributions
Knowledge of one random variable may be useful in narrowing down the possibilities for another random
variable. Therefore, it is important to understand how probability is distributed in multiple dimensions.
There are three main concepts: joint, marginal and conditional.
In general, the joint CDF for two random variables x and y is

Fxy (x, y) = P (x ≤ x, y ≤ y) (3.50)

3.8.1 Discrete Case: Joint and Marginal Mass


In the discrete case, the joint pmf for two random variables x and y is

pxy (xi , yj ) = Fxy (xi , yj ) − [Fxy (xi−1 , yj ) + Fxy (xi , yj−1 ) − Fxy (xi−1 , yj−1 )] (3.51)

See the exercises to check this formula for the matrix shown below.
Imagine nine weights placed in a 3-by-3 grid with the number indicating the relative mass.

1 val mat = MatrixD ((3 , 3) , 1 , 2 , 3 ,


2 4, 5, 6,
3 7 , 8 , 9)

Dividing the matrix by 45 or calling toProbability in the scalation.mathstat.Probability object yields


a probability matrix representing a joint probability mass function (pmf), pxy (xi , yj ),

1 MatrixD (0.0222222 , 0.0444444 , 0.0666667 ,


2 0.0888889 , 0.111111 , 0.133333 ,
3 0.155556 , 0.177778 , 0.200000)

The marginal pmfs are computed as follows:

X
px (xi ) = pxy (xi , yj ) sum out y (3.52)
yj ∈Dy

X
py (yj ) = pxy (xi , yj ) sum out x (3.53)
xi ∈Dx

Carrying out the summations or calling margProbX (pxy) for px (xi ) and margProbY (pxy) for py (yj ) gives,

1 Marginal X : px = VectorD (0.13333333333333333 , 0.33333333333333337 , 0 . 5 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 )


2 Marginal Y : py = VectorD (0.26666666666666666 , 0.33333333333333337 , 0.4)

Scaling back by multiplying by 45 produces,

1 45* Marginal X : px = VectorI (6 , 15 , 24)


2 45* Marginal Y : py = VectorI (12 , 15 , 18)

It is now easy to see that px is based on row sums, while py is based on column sums.

85
3.8.2 Continuous Case: Joint and Marginal Density
In the continuous case, the joint pdf for two random variables x and y is

∂ 2 Fxy
fxy (x, y) = (x, y) (3.54)
∂x∂y

Consider the following joint pdf that specifies the distribution of one kilogram of mass (or probability)
uniformly over a 2-by-3 meter plate.

1
fxy (x, y) = on [0, 2] × [0, 3]
6

2
y

0 0.5 1 1.5 2
x

The joint CDF is then a double integral,

Z x Z y
1 xy
Fxy (x, y) = dydx =
0 0 6 6

There are two marginal pdfs that are single integrals: Think of the mass of the vertical red line being
collected into the thick red bar at the bottom. Collecting all such lines creates the red bar at the bottom
and its mass is distributed as follows:

Z 3
1 3 1
fx (x) = dy = = on [0, 2] integrate out y
0 6 6 2

Now think of the mass of the horizontal green line being collected into the thick green bar on the left.
Collecting all such lines creates the green bar on the left and its mass is distributed as follows:

Z 2
1 2 1
fy (y) = dx = = on [0, 3] integrate out x
0 6 6 3

For more details and examples, see Class 7 of [138].

86
3.8.3 Discrete Case: Conditional Mass
Conditional probability can be examined locally. Given two discrete random variables x and y, the conditional
mass function of x given y is defined as follows:

pxy (xi , yj )
px|y (xi , yj ) = P (x = xi |y = yj ) = (3.55)
py (yj )

where pxy (xi , yj ) is the joint mass function. Again, the marginal mass functions are

X
px (xi ) = pxy (xi , yj )
yj ∈Dy
X
py (yj ) = pxy (xi , yj )
xi ∈Dx

Consider the following example: Roll two dice. Let x be the value on the first dice and y be the sum of
the two dice. Compute the conditional pmf for x given that it is known that y = 2.

pxy (xi , 2)
px|y (xi , 2) = P (x = xi |y = 2) = (3.56)
py (2)
Try this problem for each possible value for y.

3.8.4 Continuous Case: Conditional Density


Similarly, for two continuous random variables x and y, the conditional density function of x given y is
defined as follows:

fxy (x, y)
fx|y (x, y) = (3.57)
fy (y)

where fxy (x, y) is the joint density function. The marginal density functions are
Z
fx (x) = fxy (x, y)dy (3.58)
y∈Dy

Z
fy (y) = fxy (x, y)dx (3.59)
x∈Dx

The marginal density function in the x-dimension is the probability mass projected onto the x-axis from all
other dimensions, e.g., for a bivariate distribution with mass distributed in the first xy quadrant, all the
mass will fall onto the x-axis.
Consider the example below where the random variable x indicates how far down the center-line of a
straight golf hole the golf ball was driven in units of 100 yards. The random variable y indicates how far left
(positive) or right (negative) the golf ball ends up from the center of the fairway. Let us call these random
variable distance and dispersion. The golfer teed the ball up at location [0, 0]. For simplicity, assume the
probability is uniformly distributed within the triangle.

87
1

0.5

y
−0.5

−1

0 0.5 1 1.5 2 2.5 3


x

As the area of the triangle is 3, the joint density function is

1
fxy (x, y) = on x ∈ [0, 3], y ∈ [−x/3, x/3]
3
The distribution (density) of the driving distance down the center-line is given the marginal density for the
random variable x

Z x/3 x/3
1 y 2x
fx (x) = dy = =
−x/3 3 3 −x/3 9
Therefore, the conditional density of dispersion y given distance x is given by

fxy (x, y) 1/3 3


fy|x (x, y) = = = (3.60)
fx (x) 2x/9 2x

3.8.5 Independence
The two random variables x and y are said to independent denoted x ⊥ y when the joint CDF (equivalently
pmf/pdf) can be factored into the product of its marginal CDFs (equivalently pmf/pdf).

Fxy (x, y) = Fx (x) Fy (y) (3.61)

For example, determine which of the following two joint density functions defined on [0, 1]2 signify indepen-
dence.

fxy (x, y) = 4xy


fxy (x, y) = 8xy − x − y

For the first joint density, the two marginal densities are the following:

1 1
4xy 2
Z
fx (x) = 4xy dy = = 2x
0 2 0

88
1 1
4x2 y
Z
fy (y) = 4xy dx = = 2y
0 2 0

The product of the marginal densities fx (x) fy (y) = 4xy is the joint density.
Compute the conditional density under the assumption that the random variables, x and y, are indepen-
dent.

fxy (x, y)
fx|y (x, y) = (3.62)
fy (y)

As the joint density can be factored, fxy (x, y) = fx (x) fy (y), we obtain,

fx (x) fy (y)
fx|y (x, y) = = fx (x) (3.63)
fy (y)

showing that the value of random variable y has no effect on x. See the exercises for a proof that independence
implies zero covariance (and therefore zero correlation).

3.8.6 Conditional Expectation


The value of one random variable may influence the expected value of another random variable The condi-
tional expectation of random variable x given random variable y is defined as follows:
Z
E [x|y = y] = x dFx|y (x, y) (3.64)
Dx

When y is a discrete random variable, we may write

X
E [x|y = y] = x px|y (x, y) (3.65)
x∈Dx

When y is a continuous random variable, we may write


Z
E [x|y = y] = x fx|y (x, y)dx (3.66)
Dx

Consider the previous example on the dispersion y of a golf ball conditioned on the driving distance y.
Compute the conditional mean and the conditional variance for y given x.

Z x/3
µy|x = E [y|x = x] = y fy|x (x, y)dy
−x/3
Z x/3
2
= E (y − µy|x )2 |x = x = (y − µy|x )2 fy|x (x, y)dy
 
σy|x
−x/3

89
3.8.7 Conditional Independence
A wide class of modeling techniques are under the umbrella of probabilistic graphical models (e.g., Bayesian
Networks and Markov Networks). They work by factoring a joint probability based on conditional indepen-
dencies. Random variables x and y are conditionally independent given z, denoted

x ⊥ y|z

means that

Fx,y|z (x, y, z) = Fx|z (x, z) Fy|z (y, z)

90
3.9 Odds
Another way of looking a probability is odds. This is the ratio of probabilities of an event A occurring over
the event not occurring S − A.

P (y ∈ A) P (y ∈ A)
odds(y ∈ A) = = (3.67)
P (y ∈ S − A) 1 − P (y ∈ A)
For example, the odds of rolling a pair dice and getting natural is 8 to 28.

8 2
odds(y ∈ {7, 11}) = = = .2857
28 7
Of the 36 individual outcomes, eight will be a natural and 28 will not. Odds can be easily calculated from
probability.

P (y ∈ {7, 11}) 2/9 2


odds(y ∈ {7, 11}) = = = = .2857
1 − P (y ∈ {7, 11}) 7/9 7
Calculating probability from odds may be done as follows:

odds(y ∈ {7, 11}) 2/7 2


P (y ∈ {7, 11}) = = = = .2222
1 + odds(y ∈ {7, 11}) 9/7 9

91
3.10 Example Problems
Understanding of some of techniques to be discussed requires some background in conditional probability.

1. Consider the probability of rolling a natural (i.e., 7 or 11) with two dice where the random variable y
is the sum of the dice.

P (y ∈ {7, 11}) = 1/6 + 1/18 = 2/9

If you knew you rolled a natural, what is the conditional probability that you rolled a 5 or 7?

P (y ∈ {5, 7}, y ∈ {7, 11}) 1/6


P (y ∈ {5, 7} | y ∈ {7, 11}) = = = 3/4
P (y ∈ {7, 11}) 2/9
This is the conditional probability of rolling a 5 or 7 given that you rolled a natural.
More generally, the conditional probability that y ∈ A given that x ∈ B is the joint probability divided
by the probability that x ∈ B.

P (y ∈ A, x ∈ B)
P (y ∈ A | x ∈ B) =
P (x ∈ B)
where

P (y ∈ A, x ∈ B) = P (x ∈ B | y ∈ A) P (y ∈ A)

Therefore, the conditional probability of y given x is

P (x ∈ B | y ∈ A) P (y ∈ A)
P (y ∈ A | x ∈ B) =
P (x ∈ B)
This is Bayes Theorem written using random variables, which provides an alternative way to compute
conditional probabilities, i.e., P (y ∈ {5, 7} | y ∈ {7, 11}) is

P (y ∈ {7, 11} | y ∈ {5, 7}) P (y ∈ {5, 7}) (3/5) · (5/18)


= = 3/4
P (y ∈ {7, 11}) 2/9
2. To illustrate the usefulness of Bayes Theorem, consider the following problem from John Allen Paulos
that is hard to solve without it. Suppose you are given three coins, two fair and one counterfeit (always
lands heads). Randomly select one of the coins. Let x indicate whether the selected coin is fair (0) or
counterfeit (1). What is the probability that you selected the counterfeit coin?

P (x = 1) = 1/3

Obviously, the probability is 1/3, since the probability of picking any of the three coins is the same.
This is the prior probability.
Not satisfied with this level of uncertainty, you conduct experiments. In particular, you flip the selected
coin three times and get all heads. Let y indicate the number of heads rolled. Using Bayes Theorem,
we have,

92
P (y = 3 | x = 1) P (x = 1) 1 · (1/3)
P (x = 1 | y = 3) = = = 4/5
P (y = 3) 5/12

where P (y = 3) = (1/3)(1) + (2/3)(1/8) = 5/12. After conducting the experiments (collecting


evidence) the probability estimate may be improved. Now the posterior probability is 4/5.

93
3.11 Estimating Parameters from Samples
Given a model for predicting a response value for y from a feature/predictor vector x,

y = f (x; b) + 
one needs to pick a functional form for f and collect a sample of data to estimate the parameters b. The
sample will consist of m instances (yi , xi ) that form the response/output vector y and the data/input matrix
X.

y = f (X; b) + 
There are multiple types of estimation procedures. The central ideas are to minimize error or maximize
the likelihood that the model would generate data like the sample. A common way to minimize error is to
minimize the Mean Squared Error (MSE). The error vector  is the difference between the actual response
vector y and the predicted response vector ŷ.

 = y − ŷ = y − f (x; b)
The mean squared error on the length (Euclidean norm) of the error vector kk is given by

2
E kk2 = V [kk] + E [kk]
 
(3.68)
where V [kk] is error variance and E [kk] is the error mean. If the model is unbiased the error mean will
be zero, in which case the goal is to minimize the error variance.

3.11.1 Sample Mean


Suppose the speeds of cars on an interstate highway are Normally distributed with a mean at the speed
limit of 70 mph (113 kph) and a standard deviation of 8 mph (13 kph), i.e., y ∼ N (µ, σ 2 ) in which case the
model is

y = µ+
where  ∼ N (0, σ 2 ). Create a sample of size m = 100 data points, using a Normal random variate generator.
The population values for the mean µ and standard deviation σ are typically unknown and need to be
estimated from the sample, hence the names sample mean µ̂ and sample standard deviation σ̂. Show the
generated sample, by plotting the data points and displaying a histogram.
1 @ main def sampleStats () : Unit =
2

3 val ( mu , sig ) = (70.0 , 8.0) // pop . mean and std dev


4 val m = 100 // sample size
5 val rvg = Normal ( mu , sig * sig ) // Normal random variate
6 val sample = VectorD ( for i <- 0 until m yield rvg . gen ) // sample from Normal dist
7 val ( mu_ , sig_ ) = ( sample . mean , sample . stdev ) // sample mean and std dev
8 println ( s " ( mu_ , sig_ ) = (mu, sig_ ) " )
9 new Plot ( null , sample )
10 new Histogram ( sample )
11

12 end sampleStats

94
Imports: scalation.mathstat. , scalation.random. .

3.11.2 Confidence Interval


Now that you have an estimate for the mean, you begin to wonder if is correct or rather close enough.
Generally, an estimate is considered close enough if its confidence interval contains the population mean.
Collect an iid sample of values into a vector y. Then the sample mean is simply

m−1
1·y 1 X
µ̂ = = yi
m m i=0
To create a confidence interval, we need to determine the variability or variance in the estimate µ̂.
"
m−1
# m−1
1 X 1 X σ2
V [µ̂] = V yi = 2
V [yi ] =
m i=0 m i=0 m
The difference between the estimate from the sample and the population mean is Normally distributed and
centered at zero (show that µ̂ is an unbiased estimator for µ, i.e., E [µ̂] = µ).

σ2
µ̂ − µ ∼ N (0,)
m
We would like to transform the difference so that the resulting expression follows a Standard Normal
distribution. This can be done by dividing by √σm .

µ̂ − µ
√ ∼ N (0, 1)
σ/ m
Consequently, the probability that the expression is greater than z is given by the CDF of the Standard
Normal distribution, FN (z).
 
µ̂ − µ
P √ >z = 1 − FN (z)
σ/ m
One might consider that if z = 2, two standard deviation units, then the estimate is not close enough. The
same problem can exist on the negative side, so we should require

µ̂ − µ
√ ≤2
σ/ m
In other words,


|µ̂ − µ| ≤ √
m
This condition implies that µ would likely be inside the following confidence interval.
 
2σ 2σ
µ̂ − √ , µ̂ + √
m m
In this case it is easy to compute values for the lower and upper bounds of the confidence interval. The
interval half width is simply 2·8
10 = 1.6, which is to be subtracted and added to the sample mean.
Use ScalaTion to determine the probability that µ is within such confidence intervals?
1 println ( s " 1 - F (2) = ${1 - normalCDF (2) } " )

95
The probability is one minus twice this value. If 1.96 is used instead of 2, what is the probability, expressed
as a percentage.
Typically, the population standard deviation is unlikely to be known. It would need to estimated by
using the sample standard deviation, where the sample variance is

m−1
1 X
σ̂ 2 = (yi − µ̂)2 (3.69)
m − 1 i=0

Note, this textbook uses θ̂ to indicate an estimator for parameter θ, regardless of whether it is a Maxi-
mum Likelihood (MLE) estimator. This substitution introduces more variability into the estimation of the
confidence interval and results in the Standard Normal distribution (z-distribution)

z∗σ z∗σ
 
µ̂ − √ , µ̂ + √ (3.70)
m m
being replace by the Student’s t distribution

t∗ σ̂ t∗ σ̂
 
µ̂ − √ , µ̂ + √ (3.71)
m m
where z ∗ and t∗ represent distances from zero, e,g., 1.96 or 2.09, that are large enough so that the analyst
is comfortable with the probability that they may be wrong.
The numerators for the interval half widths (ihw) are calculated by the following top-level functions in
Statistics.scala. The z sigma function is used for the z-distribution.
1 def z_sigma ( sig : Double , p : Double = .95) : Double =
2 val pp = 1.0 - (1.0 - p ) / 2.0 // e . g . , .95 --> .975 ( two tails )
3 val z = random . Quantile . normalInv ( pp )
4 z * sig
5 end z_sigma

The t sigma function is used for the t-distribution.


1 def t_sigma ( sig : Double , df : Int , p : Double = .95) : Double =
2 if df < 1 then { flaw ( " interval " , " must have at least 2 observations " ) ; return 0.0 }
3 val pp = 1.0 - (1.0 - p ) / 2.0 // e . g . , .95 --> .975 ( two tails )
4 val t = random . Quantile . studentTInv ( pp , df )
5 t * sig
6 end t_sigma

Does the probability you determined in the last example problem make any sense. Seemingly, if you took
several samples, only a certain percentage of them would have the population mean within their confidence
interval.
1 @ main def c o n f i d e n c e I n t e r v a l T e s t () : Unit =
2

3 val ( mu , sig ) = (70.0 , 8.0) // pop . mean and std dev


4 val m = 100 // sample size
5 val rm = sqrt ( m )
6 val rvg = Normal ( mu , sig * sig ) // Normal random variate
7 var count_z , count_t = 0
8

9 for it <- 1 to 100 do // test several datasets


10 val y = VectorD ( for i <- 0 until m yield rvg . gen ) // sample from Normal dist

96
11 val ( mu_ , sig_ ) = ( y . mean , y . stdev ) // sample mean and std dev
12

13 val ihw_z = z_sigma ( sig_ ) / rm // interval half width : z


14 val ci_z = ( mu_ - ihw_z , mu_ + ihw_z ) // z - confidence interval
15 println ( s " mu = muinciz =ci_z ? " )
16 if mu in ci_z then count_z + = 1
17

18 val ihw_t = t_sigma ( sig_ , m -1) / rm // interval half width : t


19 val ci_t = ( mu_ - ihw_t , mu_ + ihw_t ) // z - confidence interval
20 println ( s " mu = muincit =ci_t ? " )
21 if mu in ci_t then count_t + = 1
22 end for
23

24 println ( s " mu inside countz println(s”muinsidecount_t % t - confidence intervals " )


25

26 end c o n f i d e n c e I n t e r v a l T e s t

Imports: scalation. , scalation.mathstat. , scalation.random. .


Try various values for m starting with m = 20. Compute percentages for both the t-distribution and the
z-distribution. Given the default confidence level used by ScalaTion is 0.95 (or 95%) what would you
expect your percentages to be?

3.11.3 Estimation for Discrete Outcomes/Responses


Explain why the probability mass function (pmf) for flipping a coin n times with the experiment resulting in
the discrete random variable y = k heads is given by the Binomial Distribution having unknown parameter
p, the probability of getting a head for any particular coin flip,
 
n k
pn (k) = P (y = k) = p (1 − p)n−k
k
i.e., y ∼ Binomial(n, p).
Now suppose an experiment is run and y = k, a fixed number, e.g., n = 100 and k = 60. For various
values of p, plot the following function.
 
n k
L(p) = p (1 − p)n−k
k
What value of p maximizes the function L(p)? The function L(p) is called the Likelihood function and it is
used in Maximum Likelihood Estimation (MLE) [139].
The VectorD class provides methods for computing statistics on vectors.

97
3.12 Entropy

The entropy of a discrete random variable y with probability mass function (pmf) py (y) is the negative of
the expected value of the log of the probability.

X
H(y) = H(py ) = − E [log2 py ] = − py (y) log2 py (y) (3.72)
y∈Dy

The following single loop is used in ScalaTion to compute entropy.

1 def entropy ( px : VectorD ) : Double =


2 var sum = 0.0
3 for p <- px if p > 0.0 do sum -= p * log2 ( p )
4 sum
5 end entropy

For finite domains of size k = |Dy |, entropy H(y) ranges from 0 to log2 (k). Low entropy (close to 0) means
that there is low uncertainty/risk in predicting an outcome of an experiment involving the random variable
y, while high entropy (close to log2 k) means that there is high uncertainty/risk in predicting an outcome of
such an experiment. For binary classification (k = 2), the upper bound on entropy is 1.
The entropy may be normalized by setting the base of the logarithm to the size of the domain k, in which
case, the entropy will be in the interval [0, 1].

X
Hk (y) = Hk (py ) = − E [logk py ] = − py (y) logk py (y)
y∈Dy

A random variable y ∼ Bernoulli(p) may be used to model the flip of a single coin that has a probability of
success/head (1) of p. Its pmf is given by the following formula.

p(y) = py (1 − p)1−y

The pmf py can be captured in a probability vector py

H(y) = H(py ) = H(py ) = H([p, 1 − p]) = p log2 p + (1 − p) log2 1 − p

The figure below plots the entropy H([p, 1 − p]) as probability of a head p ranges from 0 to 1.

98
Entropy for Bernoulli pmf

0.8

0.6

fy (y)
0.4

0.2

0 0.2 0.4 0.6 0.8 1


y

A random variable y = z1 + z2 where z1 , z2 are distributed as Bernoulli(p) may be used to model the
sum of flipping two coins.

H(y) = H(py ) = H([p2 , 2p(1 − p), (1 − p)2 ])

See the exercises for how to extend entropy to continuous random variables.

3.12.1 Positive Log Probability


Entropy can also be expressed in terms of positive log probability [57] (or plog).

plog(y) = − log2 py (3.73)

Then entropy is simply the expected values of the plog.

H(y) = E [plog(y)] (3.74)

The concept of plog can also be used in place of probability and offers several advantages: (1) multiplying
many small probabilities may lead to round off error or underflow; (2) independence leads to addition of
plog values rather than multiplication of probabilities; and (3) its relationship to log-likelihood in Maximum
Likelihood Estimation.

X
plog(x) = plog(xj ) for independent random variables (3.75)
j

The greater the plog the less likely the occurrence, e.g., the plog of rolling snake eyes (1, 1) with two dice is
about 5.17, while probability of rolling 7 is 2.58. Note, probability 1 and .5 give plog of 0 and 1, respectively.

99
3.12.2 Joint Entropy
Entropy may be defined for multiple random variables as well. Given two discrete random variables, x, y,
with a joint pmf px,y (x, y) the joint entropy is defined as follows:
X X
H(x, y) = H(px,y ) = − E [log2 px,y ] = − px,y (x, y) log2 px,y (x, y) (3.76)
x∈Dx y∈Dy

3.12.3 Conditional Entropy


Replacing the joint pmf, with the conditional pmf gives conditional entropy.

  X X
H(x|y) = H(px|y ) = − E log2 px|y = − px,y (x, y) log2 px|y (x, y) (3.77)
x∈Dx y∈Dy

Suppose an experiment involves two random variables x and y. Initially, the overall entropy is given by the
joint entropy H(x, y). Now, partial evidence allows the value of y to be determined, so the overall entropy
should decrease by y’s entropy.

H(x|y) = H(x, y) − H(y) (3.78)

When there is no dependency between x and y (i.e., they are independent), H(x, y) = H(x) + H(y)), so

H(x|y) = H(x) (3.79)

At the other extreme, when there is full dependency (i.e., they value of x can be determined from the value
of y).

H(x|y) = 0 (3.80)

3.12.4 Relative Entropy


Relative entropy, also known as Kullback-Leibler (KL) divergence, measures the dissimilarity between two
probability distributions.

Discrete Random Variable

Given a discrete random variables, y, with two candidate probability mass functions (pmf)s py (y) and qy (y)
the relative entropy is defined as follows:
 
py X py (y)
H(py ||qy ) = E log2 = py (y) log2 (3.81)
qy qy (y)
y∈Dy

One way to look at relative entropy is that it measures the uncertainty that is introduced by replacing
the true/empirical distribution py with an approximate/model distribution qy . If the distributions are
identical, then the relative entropy is 0, i.e., H(py ||py ) = 0. The larger the value of H(py ||qy ) the greater
the dissimilarity between the distributions py and qy .
As an example, assume the true distribution for a coin is [.6, .4], but it is thought that the coin is fair
[.5, .5]. The relative entropy is computed as follows:

100
H(py ||qy ) = .6 log2 .6/.5 + .4 log2 .4/.5 = 0.029

Continuous Random Variable

Given a continuous random variables, y, with two candidate probability density functions (pdf)s fy (y) and
gy (y) the relative entropy is defined as follows:
  Z
fy fy (y)
H(fy ||gy ) = E log2 = fy (y) log2 dy (3.82)
gy Dy gy (y)

Maximum Likelihood Estimation

In this subsection, we examine the relationship between KL Divergence and Maximum Likelihood. Consider
the dissimlarity of an empirical distribution pdata (x, y) and a model generated distribution pmod (x, y, b).

 
pdata (x, y)
H(pdata (x, y)||pmod (x, y, b)) = E log (3.83)
pmod (x, y, b)
m−1
X pdata (xi , yi )
= pdata (xi , yi ) log (3.84)
i=0
pmod (xi , yi , b)

Note, that pdata (xi , yi ) is unaffected by the choice of parameters b, so it represents a constant C.

m−1
X
H(pdata (x, y)||pmod (x, y, b)) = C − pdata (xi , yi ) log pmod (xi , yi , b) (3.85)
i=0
1
The probability for the ith data instance is m, thus

m−1
1 X
H(pdata (x, y)||pmod (x, y, b)) = C − log pmod (xi , yi , b) (3.86)
m i=0

The second term is the negative log-likelihood (the Chapter on Generalized Linear Models for details).

3.12.5 Cross Entropy


Relative entropy can be adjusted to capture overall entropy by adding the entropy of py .

H(py × qy ) = H(py ) + H(py ||qy ) (3.87)

It is the sum of the entropy of the empirical distribution and the model distribution’s relative entropy to the
empirical distribution. It can be calculated using the following formula (see exercises for details):
X
H(py × qy ) = − py (y) log2 qy (y) (3.88)
y∈Dy

Since cross entropy is more efficient to calculate than relative entropy, it is a good candidate as a loss function
for machine learning algorithms. The smaller the cross entropy, the more the model (e.g., Neural Network)
agrees with the empirical distribution (dataset). The formula looks like the one for ordinary entropy with
qy substituted in as the argument for the log function. Hence the name cross entropy.

101
3.12.6 Mutual Information
Recall that if x and y are independent, then for all x ∈ Dx and y ∈ Dy ,

px,y (x, y) = px (x) py (y)


A possibly more interesting and practical question is to measure how close two random variables are to being
independent. One approach is to look at the covariance (or correlation) between x and y.
X X
C [x, y] = E [(x − µx )(y − µy )] = (x − µx )(y − µy )px,y (x, y)
x∈Dx y∈Dy

If x and y are independent, then


X X
C [x, y] = [(x − µx )px (x)] [(y − µy )py (y)] = 0
x∈Dx y∈Dy

An alternative is to look at the relative entropy of px,y and px py .


 
px,y X X px,y (x, y)
H(px,y ||px py ) = E log2 = px,y (x, y) log2 (3.89)
px py px (x) py (y)
x∈Dx y∈Dy

The relative entropy (KL divergence) of the joint distribution to the product of the marginal distributions
is referred to as mutual information.

I(x; y) = H(px,y ||px py ) (3.90)


The following double loop is used in ScalaTion to compute mutual information.
1 def muInfo ( pxy : MatrixD , px : VectorD , py : VectorD ) : Double =
2 var sum = 0.0
3 for i <- pxy . indices ; j <- pxy . indices2 do
4 val p = pxy (i , j )
5 if p > 0.0 then sum + = p * log2 ( p / ( px ( i ) * py ( j ) ) )
6 end for
7 sum
8 end muInfo

As with covariance (or correlation) mutual information will be zero when x and y are independent. While
independence implies zero covariance, independence is equivalent to zero mutual information. Mutual infor-
mation is symmetric and non-negative. See the exercises for additional comparisons between covariance/-
correlation and mutual information.
While mutual information measures the dependence between two random variables, relative entropy (KL
divergence) measures the dissimilarity of two distribution.
Mutual Information corresponds to Information Gain, i.e., the drop in entropy of one random variable
due to knowledge of the value of the other random variable.

I(x; y) = H(x) − H(x|y) = H(y) − H(y|x) (3.91)


The Probability object in the scalation.stat package provides methods to compute probabilities
from frequencies, compute joint, marginal, conditional and log probabilities, as well as entropy, normalized
entropy, relative entropy, cross entropy, and mutual information.

102
3.12.7 Probability Object

Class Methods:
1 object Probability :
2

3 def isProbability ( px : VectorD ) : Boolean = px . min >= 0.0 && abs ( px . sum - 1.0) < EPSILON
4 def isProbability ( pxy : MatrixD ) : Boolean = pxy . mmin >= 0.0 && abs ( pxy . sum - 1.0) <
EPSILON
5 def freq ( x : VectorI , vc : Int , y : VectorI , k : Int ) : MatrixD =
6 def freq ( x : VectorI , y : VectorI , k : Int , vl : Int ) : ( Double , VectorI ) =
7 def freq ( x : VectorD , y : VectorI , k : Int , vl : Int , cont : Boolean ,
8 def count ( x : VectorD , vl : Int , cont : Boolean , thres : Double ) : Int =
9 def toProbability ( nu : VectorI ) : VectorD = nu . toDouble / nu . sum . toDouble
10 def toProbability ( nu : VectorI , n : Int ) : VectorD = nu . toDouble / n . toDouble
11 def toProbability ( nu : MatrixD ) : MatrixD = nu / nu . sum
12 def toProbability ( nu : MatrixD , n : Int ) : MatrixD = nu / n . toDouble
13 def probY ( y : VectorI , k : Int ) : VectorD = y . freq ( k ) . _2
14 def jointProbXY ( px : VectorD , py : VectorD ) : MatrixD = outer ( px , py )
15 def margProbX ( pxy : MatrixD ) : VectorD =
16 def margProbY ( pxy : MatrixD ) : VectorD =
17 def condProbY_X ( pxy : MatrixD , px_ : VectorD = null ) : MatrixD =
18 def condProbX_Y ( pxy : MatrixD , py_ : VectorD = null ) : MatrixD =
19 inline def plog ( p : Double ) : Double = - log2 ( p )
20 def plog ( px : VectorD ) : VectorD = px . map ( plog ( _ ) )
21 def entropy ( px : VectorD ) : Double =
22 def entropy ( nu : VectorI ) : Double =
23 def entropy ( px : VectorD , b : Int ) : Double =
24 def nentropy ( px : VectorD ) : Double =
25 def rentropy ( px : VectorD , qx : VectorD ) : Double =
26 def centropy ( px : VectorD , qx : VectorD ) : Double =
27 def entropy ( pxy : MatrixD ) : Double =
28 def entropy ( pxy : MatrixD , px_y : MatrixD ) : Double =
29 def muInfo ( pxy : MatrixD , px : VectorD , py : VectorD ) : Double =
30 def muInfo ( pxy : MatrixD ) : Double = muInfo ( pxy , margProbX ( pxy ) , margProbY ( pxy ) )

For example, the following freq method is used by Naı̈ve Bayes Classifiers. It computes the Joint Frequency
Table (JFT) for all value combinations of vectors x and y by counting the number of cases where xi = v
and yi = c.
1 @ param x the variable / feature vector
2 @ param vc the number of distinct values in vector x ( value count )
3 @ param y the response / classif ication vector
4 @ param k the maximum value of y + 1 ( number of classes )
5

6 def freq ( x : VectorI , vc : Int , y : VectorI , k : Int ) : MatrixD =


7 val jft = new MatrixD ( vc , k )
8 for i <- x . indices do jft ( x ( i ) , y ( i ) ) + = 1
9 jft
10 end freq

103
3.13 Exercises
Several random number and random variate generators can be found in ScalaTion’s random package. Some
of the following exercises will utilize these generators.

1. Let the random variable h be the number heads when two coins are flipped. Determine the following
conditional probability: P (h = 2|h ≥ 1).

2. Prove Bayes Theorem.

P (B|A)P (A)
P (A|B) =
P (B)

3. Compute the mean and variance for the Bernoulli Distribution with success probability p.

py (y) = py (1 − p)1−y for y ∈ {0, 1}

4. Use the Randi random variate generator to run experiments to check the pmf and CDF for rolling two
dice.
1 import scalation . mathstat . _
2 import scalation . random . Randi
3

4 @ main def diceTest () : Unit =


5 val dice = Randi (1 , 6)
6 val x = VectorD . range (0 , 13)
7 val freq = new VectorD (13)
8 for i <- 0 until 10000 do
9 val sum = dice . igen + dice . igen
10 freq ( sum ) + = 1
11 end for
12 new Plot (x , freq )
13 end diceTest

5. Show that the variance may be written as follows:

2
V [y] = E (y − E [y])2 = E y 2 − E [y]
   

6. Show that the covariance may be written as follows:

C [x, y] = E [(x − E [x])(y − E [y])] = E [xy] − E [x] E [y]

7. Show that the covariance of two independent, continuous random variables, x and y, is zero.

Z Z
C [x, y] = E [(x − µx )(y − µy )] = (x − µx )(y − µy )fxy (x, y)dxdy
Dy Dx

where µx = E [x] and µy = E [y].

104
8. Derive the formula for the expectation of the sum of random variables.

E [x1 + x2 ] = E [x1 ] + E [x2 ]

9. Derive the formula for the variance of the sum of random variables.

V [x1 + x2 ] = V [x1 ] + V [x2 ] + 2C [x1 , x2 ]


  2
Hint: use V [x1 + x2 ] = E (x1 + x2 )2 − E [x1 + x2 ]

10. Use the Uniform random variate generator and the Histogram class to run experiments illustrating
the Central Limit Theorem (CLT).
1 import scalation . mathstat . _
2 import scalation . random . Uniform
3

4 @ main def cLTTest () : Unit =


5

6 val rg = Uniform ()
7 val x = VectorD ( for i <- 0 until 100000 yield rg . gen + rg . gen + rg . gen + rg . gen )
8 new Histogram ( x )
9

10 end cLTTest

Try with other distributions such as Exponential.

11. Chi-square distribution: Show that if z ∼ Normal(0, 1), then

z 2 ∼ χ21

12. Student’s t distribution: Show that if z ∼ Normal(0, 1) and v ∼ χ2k , then

z
p ∼ tk
v/k

13. F distribution: Show that if u ∼ χ2k1 and v ∼ χ2k2 , then

u/k1
∼ Fk1 ,k2
v/k2

14. Run the confidenceIntervalTest main function (see the Confidence Interval section) for values of
m = 20 to 40, 60, 80 and 100. Report the confidence interval and the number cases when the true
values was inside the confidence interval for (a) the z-distribution and (b) the t-distribution. Explain.

15. Given three random variables such that x ⊥ y | z, show that

Fx|y,z (x, y, z) = Fx|z (x, z)

105
16. Show that formula for computing the joint probability mass function (pmf) for the 3-by-3 grid of
weights is correct. Hint: Add/subtract rectangular regions of the grid and make sure nothing is double
counted.

17. Show for k = 2 where pp = [p, 1 − p], that H(pp) = p log2 (p) + (1 − p) log2 (1 − p). Plot the entropy
H(pp) versus p.
1 val p = VectorD . range (1 , 100) / 100.0
2 val h = p . map ( p = > -p * log2 ( p ) - (1 - p ) * log2 (1 - p )
3 new Plot (p , h )

18. Plot the entropy H and normalized entropy Hk for the first 16 Binomial(p, n) distributions, i.e., for
the number of coins n = 1, . . . , 16. Try with p = .6 and p = .5.

19. Entropy can be defined for continuous random variables. Take the definition for discrete random
variables and replace the sum with an integral and the pmf with a pdf. Compute the entropy for
y ∼ Uniform(0, 1).

20. Using the summation formulas for entropy, relative entropy and cross entropy, show that cross entropy
is the sum of entropy and relative entropy.

21. Show that mutual information equals the sum of marginal entropies minus the joint entropy, i.e.,

I(x; y) = H(x) + H(x) − H(x, y)

22. Compare correlation and mutual information in terms of how well they measure dependence between
random variables x and y. Try various functional relationships: negative exponential, reciprocal,
constant, logarithmic, square root, linear, right-arm quadratic, symmetric quadratic, cubic, exponential
and trigonometric.

y = f (x) + 

Other types of relationships are also possible. Try various constrained mathematical relations: circle,
ellipse and diamond.

f (x, y) +  = c

What happens as the noise  increases?

23. Consider an experiment involving the roll of two dice. Let x indicate the value of dice 1 and x2 indicate
of the value of dice 2. In order to examine dependency between random variables, define y = x + x2 .
The joint pmf px,y can be recorded in a 6-by-11 matrix that can be computed from the following
feasible occurrence matrix (0 → cannot occur, 1 → can occur), since all the non-zero probabilities are
the same (equal likelihood).
1 // X - dice 1: 1 , 2, 3, 4, 5, 6
2 // X2 - dice 2: 1 , 2, 3, 4, 5, 6
3 // Y = X + X2 : 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12
4 val nuxy = MatrixD ((6 , 11) , 1 , 1 , 1 , 1 , 1 , 1 , 0 , 0 , 0 , 0 , 0 ,

106
5 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
6 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0,
7 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0,
8 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0,
9 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1)

Use methods in the Probability object to compute the joint, marginal and conditional probability
distributions, as well as the joint, marginal, conditional and relative entropy, and mutual information.
Explore the independence between random variables x and y.

24. Convolution. The convolution operator may be applied to vectors as well as functions (including
mass and density functions). Consider two vectors c ∈ Rm and x ∈ Rn . Without loss of generality let
m ≤ n, then their convolution is defined as follows:

 
m−1
X
y = c ? x =  yk = cj xk−j  (3.92)
j=0
k=0,m+n−2

Compute the (’full’) convolution of c and x.

y = c ? x = [1, 2, 3, 4, 5] ? [1, 2, 3, 4, 5, 6, 7] (3.93)

Note, there are also ’same’ and ’valid’ versions of convolution operators.

25. Consider a distribution with density on the interval [0, 2]. Let the probability density function (pdf)
for this distribution be the following:

y
fy (y) = on [0, 2]
2
(i) Draw/plot the pdf fy (y) vs. y for the interval [0, 2].
(ii) Determine the Cumulative Distribution Function (CDF), Fy (y).
(iii) Draw/plot the CDF Fy (y) vs y for the interval [0, 2].
(iv) Determine the expected value of the Random Variable (RV) y, i.e., E [y].

26. Take the limit of the difference quotient of monomial xn to show that

d n
x = nxn−1
dx
Recall the definition of derivative as the limit of the difference quotient.

d f (x + h) − f (x)
f (x) = lim
dx h→0 h
Recall the notations due to Leibniz, Lagrange, and Euler.

d
f (x) = f 0 (x) = Dx f (x)
dx

107
27. Take the integral and then the derivative of the monomial xn to show that
Z
d
xn dx = xn
dx

108
3.14 Further Reading
1. Probability and Mathematical Statistics [163].

2. Entropy, Relative Entropy and Mutual Information [35].

109
3.15 Notational Conventions
With respect to random variables, vectors and matrices, the following notational conventions shown in Table
3.1 will be used in this book.

Table 3.1: Notational Conventions Followed

variable type case font color


scalar lower italics black
vector lower bold black
matrix upper italics black
tensor upper bold black
random scalar lower italics blue
random vector lower bold blue

Built on the Functional Programming features in Scala, ScalaTion support several function types:
1 type FunctionS2S = Double = > Double // function of a scalar
2 type FunctionS2V = Double = > VectorD // vector - valued function of a scalar
3

4 type FunctionV2S = VectorD = > Double // function of a vector


5 type FunctionV2V = VectorD = > VectorD // vector - valued function of a vector
6 type FunctionV2M = VectorD = > MatrixD // matrix - valued function of a vector
7

8 type FunctionM2V = MatrixD = > VectorD // vector - valued function of a matrix


9 type FunctionM2M = MatrixD = > MatrixD // matrix - valued function of a matrix

These function types are defined in the scalation and scalation.mathstat packages. A scalar-valued
function type ends in ’S’, a vector-valued function type ends in ’V’, and a matrix-valued function type ends
in ’M’.
Mathematically, the scalar-valued functions are denoted by a symbol, e.g., f .

S2S function f : R → R
V2S function f : Rn → R

Mathemtically, the vector-valued functions are denoted by a bold symbol, e.g., f .

S2V function f : R → Rn
V2V function f : Rm → Rn
M2V function f : Rm×p → Rn

Mathemtically, the matrix-valued functions are denoted by a bold symbol, e.g., f .

110
V2M function f : Rp → Rm×n
M2M function f : Rp×q → Rm×n

111
3.16 Model
Models are about making predictions such as given certain properties of a car, predict the car’s mileage, given
recent performance of a stock index fund, forecast its future value, or given a person’s credit report, classify
them as either likely to repay or not likely to repay a loan. The thing that is being predicted, forecasted
or classified is referred to the response/output variable, call it y. In many cases, the “given something” is
either captured by other input/feature variables collected into a vector, call it x,

y = f (x; b) +  (3.94)

or by previous values of y. Some functional form f is chosen to map input vector x into a predicted value
for response y. The last term indicates the difference between actual and predicted values, i.e., the residuals
. The function f is parameterized and often these parameters can be collected into a matrix b.
If values for the parameter vector b are set randomly, the model is unlikely to produce accurate pre-
dictions. The model needs to be trained by collecting a dataset, i.e., several (m) instances of (xi , yi ), and
optimizing the parameter vector b to minimize some loss function, such as mean squared error (mse),

1
mse = ky − ŷk2 (3.95)
m
where y is the vector from all the response instances and ŷ = f (X; b) is the vector of predicted response
values and X is the matrix formed from all the input/feature vector instances.

Estimation Procedures

Although there are many types of parameter estimation procedures, this text only utilizes the three most
commonly used procedures [14].

Table 3.2: Estimation Procedures

Procedure Full Name Inventor


LSE Least Squares Estimation Gauss
MoM Method of Moments Pearson
MLE Maximum Likelihood Estimation Fisher

The method of moments develops equations the relate the moments of a distribution to the parameters of the
model, in order to create estimates for the parameters. Least Squares Estimation takes the sum of squared
errors and sets the parameter values to minimize this sum. It has three main varieties: Ordinary Least
Squares (OLS), Weighted Least Squares (WLS), and Generalized Least Squares (GLS). Finally, Maximum
Likelihood Estimation sets the parameter values so that the observed data is likely to occur. The easiest
way to think about this is to imagine that one wants to create a generative model (a model that generates
data). One would want to set the parameters of the model so it generates data that looks like the given
dataset.
Setting of parameters is done by solving a system of equations for the simpler models, or by using an
optimization algorithm for more complex models.

112
Quality of Fit (QoF)

After a model is trained, its Quality of Fit (QoF) should be evaluated. One way to perform the evaluation
is to train the model on the full dataset and test as well on the full dataset. For complex models with many
parameters, over-fitting will likely occur. Then its excellent evaluation is unlikely to be reproduced when the
model is applied in the real-world. To avoid overly optimistic evaluations due to over-fitting, it is common
to divide a dataset (X, y) into a training dataset and testing dataset where training is conducted on the
training dataset (Xr , yr ) and evaluation is done on the test dataset (Xe , ye ). The conventions used in this
book for the full, training and test datasets are shown in Table 3.3

Table 3.3: Convention for Datasets

Math Symbol Code Description


X x full data/input matrix
X x training data/input matrix (maybe full)
Xe xe test data/input matrix (maybe full)
y y full response/output vector
y y training response/output vector (maybe full)
ye ye test response/output vector (maybe full)

Note, when training and testing on the full dataset, the training and test dataset are actually the same, i.e.,
they are the full dataset. If a model has many parameters, the Quality of Fit (QoF) found from training
and testing on the full dataset should be suspect. See the section on cross-validation for more details.
In ScalaTion, the Model trait severs as base trait for all the modeling techniques in the modeling
package and its sub-packages classifying, clustering, fda, forecasting, and recommeneder.

Model Trait

Trait Methods:
1 trait Model :
2

3 def getFname : Array [ String ]


4 def train ( x_ : MatrixD , y_ : VectorD ) : Unit
5 def test ( x_ : MatrixD , y_ : VectorD ) : ( VectorD , VectorD )
6 def predict ( z : VectorD ) : Double | VectorD
7 def hparameter : HyperPar ameter
8 def parameter : VectorD | MatrixD
9 def report ( ftVec : VectorD ) : String =
10 def report ( ftMat : MatrixD ) : String =

The getFname method returns the predictor variable/feature names in the model. The train method
will use a training or full dataset to train the model, i.e., optimize its parameter vector b to minimize a
given loss function. After training, the quality of the model may be assessed using the test method. The
evaluation may be performed on a test or full dataset. Finally, information about the model may be extracted

113
by the following three methods: (1) hparameter showing the hyper-parameters, (2) parameter showing the
parameters, and (3) report showing the hyper-parameters, the parameter, and the Quality of Fit (QoF) of
the model. Note, hyper-parameters are used by some modeling techniques to influence either the result or
how the result is obtained.
Classes that implement (directly or indirectly) the Model trait should default x and x e to the full
data/input matrix x, and y and y e to the full response/output vector y that are passed into the class
constructor.
Implementations of the train method take a training data/input matrix x and a training respon-
se/output vector y and optimize the parameter vector b to, for example, minimize error or maximize
likelihood. Implementations of the test method take a test data/input matrix x e and the corresponding
test response/output vector y e to compute errors and evaluate the Quality of Fit (QoF). Note that with
cross-validation (to be explained later), there will be multiple training and test datasets created from one
full dataset. Implementations of the hparameter method simply return the hyper-parameter vector hparam,
while implementations of the parameter method simply return the optimized parameter vector b. (The
fname and technique parameters for Regression are the feature names and the solution/optimization
technique used to estimate the parameter vector, respectively.)
Associated with the Model trait is the FitM trait that provides QoF measures common to all types of
models. For prediction, Fit extends FitM with several additional QoF measures and they are explained on
the Prediction Chapter. Similarity, FitC extends FitM for classification models.

FitM Trait

Trait Methods:
1 trait FitM :
2

3 def sse_ : Double = sse


4 def rSq_ : Double = rSq // using mean
5 def rSq0_ : Double = rSq0 // using 0
6 def diagnose ( y : VectorD , yp : VectorD , w : VectorD = null ) : VectorD =
7 def fit : VectorD
8 def help : String
9 def summary ( x_ : MatrixD , fname : Array [ String ] , b : VectorD , vifs : VectorD = null ) :
10 String

The diagnose method takes the actual response/output vector y and the predictions from the model yp
and calculates the basic QoF measures.
1 @ param y the actual response / output vector to use ( test / full )
2 @ param yp the predicted response / output vector ( test / full )
3 @ param w the weights on the instances ( defaults to null )
4

5 def diagnose ( y : VectorD , yp : VectorD , w : VectorD = null ) : VectorD =


6 m = y . dim // size of response vector
7 if m < 2 then flaw ( " diagnose " , s " requires at least 2 responses to evaluate m = $m " )
8 if yp . dim ! = m then flaw ( " diagnose " , s " yp . dim = ${ yp . dim } ! = y . dim = $m " )
9

114
10 val mu = y . mean // mean of y ( may be zero )
11 val e = y - yp // residual / error vector
12 sse = e . normSq // sum of squares for error
13 if w = = null then
14 sst = y . cnormSq // sum of squares total
15 ssr = sst - sse // sum of squares model
16 else
17 ssr = ( w * ( yp - ( w * yp / w . sum ) . sum ) ~ ˆ 2) . sum // regression sum of squares
18 sst = ssr + sse
19 end if
20

21 mse0 = sse / m // raw / MLE mean squared error


22 rmse = sqrt ( mse0 ) // root mean squared error
23 mae = e . norm1 / m // mean absolute error
24 rSq = ssr / sst // R ˆ 2 using mean
25 rSq0 = 1 - sse / y . normSq // R ˆ 2 using 0
26 e // returns error vector
27 end diagnose

Note, ˜^ is the exponentiation operator provided in ScalaTion, where the first character is ˜ to give the
operator higher precedence than multiplication (*).
One of the measures is based on absolute errors, Mean Absolute Error (MAE), and is computed as the
`1 norm of the error vector divided by the number of elements in the response vector (m). The rest are
based on squared values. Various squared `2 norms may be taken to compute these quantities, i.e., sst =
y.cnormSq is the centered norm squared of y, while sse = e.normSq is the norm squared of e. Then ssr,
the sum of squares model/regression, is the difference. The idea being that one started with the variation in
the response, some of which can accounted for by the model, with the remaining part considered errors. As
models are less than perfect, what remains are better referred to as residuals, part of which a better model
could account for. The fraction of the variation accounted for by the model to the total variation is called
the coefficient of determination R2 = ssr/sst ≤ 1. A measure that parallel MAE is the Root Mean Squared
Error (RMSE). It is typically higher as a large squared term has more of an effect. Both are interpretable
as they are in the units of the response variable, e.g., imagine one hits a golf ball at 150 mph with an MAE
of 7 mph and an RMSE of 10 mph. Further explanations are given in the Prediction Chapter.

115
116
Chapter 4

Data Management

4.1 Introduction
Data Science relies on having large amounts of quality data. Collecting data and handling data quality issues
are of utmost importance. Without support from a system or framework, this can be very time-consuming
and error-prone. This chapter provides a quick overview of the support provided by ScalaTion for data
management.
In the era of big data, a variety of database management technologies have been proposed, including
those under the umbrella of Not-only-SQL (NoSQL). These technologies include the following:

• Key-value stores (e.g., Memcached). When the purpose the data store is very rapid lookup and not
advanced query capabilities, a key-value store may be ideal. They are often implemented as distributed
hash tables.

• Document-oriented databases (e.g., MongoDB). These databases are intended for storage and retrieval
of unstructured (e.g., text) and semi-structured (e.g., XML or JSON) data.

• Columnar databases (e.g., Vertica). Such databases are intended for structured data like traditional
relational databases, but to better facilitate data compression and analytic operations. Data is stored
in columns rather rows as in traditional relational databases.

• Graph databases (e.g., Neo4j). These make the implicit relationships (via foreign-key, primary-key
pairs) in relational databases explicit. A tuple in a relational database is mapped to a node in a graph
database, while an implicit relationship is mapped to edge in a graph database. The database then
consists of a collection directed graphs, each consisting of nodes and edges connecting the nodes. These
database are particularly suited to social networks.

The purpose of these database technologies is to provide enhanced performance over traditional, row-oriented
relational database and each of the above are best suited to particular types of data.
Data management capabilities provided by ScalaTion include Relational Databases, Columnar Databases
and Graph Databases. All include extensions making them suitable as a Time Series DataBase (TSDB).
Graph databases are discussed in the Appendix.
Preprocessing of data should be done before applying analytics techniques to ensure they are working on
quality data. ScalaTion provides a variety of preprocessing techniques, as discussed in the next chapter.

117
4.1.1 Analytics Databases
In data science, it is convenient to collect data from multiple sources and store the data in a database.
Analytics databases are organized to support efficient data analytics.
A database supporting data science should make it easy and efficient to view and select data to be feed
into models. The structures supported by the database should make it easy to extract data to create vectors,
matrices and tensors that are used by data science tools and packages.
Multiple systems, including ScalaTion’s TSDB, are built on top of columnar, main memory databases
in order to provide high performance. ScalaTion’s TSDB is a Time Series DataBase that has built-in
capabilities for handling time series data. It is able to store non-time series data as well. It provides multiple
Application Programming Interfaces (APIs) for convenient access to the data [?].

4.1.2 The Tabular Trait


A common interface in the form of a Scala trait is provided for both Relational and Columnar Relational
databases. A Tabular database will have a name, a schema or array of attribute/column names, a domain
or array of domains/data types, and a key (primary key).
1 @ param name the name of the table
2 @ param schema the attributes for the table
3 @ param domain the domains / data - types for the attributes ( ’D ’ , ’I ’ , ’L ’ , ’S ’ , ’X ’ , ’T ’)
4 @ param key the attributes forming the primary key
5

6 trait Tabular [ T <: Tabular [ T ]] ( val name : String , val schema : Schema ,
7 val domain : Domain , val key : Schema )
8 extends Serializable :

For convenience, the following two Scala type definitions are utilized.
1 type Schema = Array [ String ]
2 type Domain = Array [ Char ]

Tabular structures are logically linked together via foreign keys. A foreign key is an attribute that
references a primary key in some table (typically another table). In ScalaTion, the foreign key specification
is added via the following method call after the Tabular structure is created.
1 def addForeignKey ( fkey : String , refTab : T ) : Unit

ScalaTion supports the following domains/data-types: ’D’ouble, ’I’nt, ’L’ong, ’S’tring, and ’T’imeNum.
1 ’D ’ - ‘ Double ‘ - ‘ VectorD ‘ - 64 bit double precision floating point number
2 ’I ’ - ‘Int ‘ - ‘ VectorI ‘ - 32 bit integer
3 ’L ’ - ‘ Long ‘ - ‘ VectorL ‘ - 64 bit long integer
4 ’S ’ - ‘ String ‘ - ‘ VectorS ‘ - variable length numeric string
5 ’T ’ - ‘ TimeNum ‘ - ‘ VectorT ‘ - time numbers for date - time

These data types are generalized into a ValueType as a Scala union type.
1 type ValueType = ( Double | Int | Long | String | TimeNum )

118
4.2 Relational Data Model
A relational database table may be built up as follows: A cell in the table holds an atomic value of type
ValueType. A tuple (or row) in the table is simply an array of ValueType. A relational Table consists of a
bag (or multi-set) of tuples. Each column in the Table is restricted to a particular domain. Note, uniqueness
of primary keys is enforced by creating a primary index.
1 type Tuple = Array [ ValueType ]

4.2.1 Data Definition Language


The Data Definition Language consists of a Table class constructor and associated apply methods in the
companion object. The following example illustrates the creation of four tables based on the example Bank
schema given in [74].
1 val customer = Table ( " customer " , " cname , street , ccity " ,
2 "S , S , S " , " cname " )
3 val branch = Table ( " branch " , " bname , assets , bcity " ,
4 "S , D , S " , " bname " )
5 val deposit = Table ( " deposit " , " accno , balance , cname , bname " ,
6 "I , D , S , S " , " accno " )
7 val loan = Table ( " loan " , " loanno , amount , cname , bname " ,
8 "I , D , S , S " , " loanno " )

4.2.2 Data Manipulation Language


As with many database systems, the Data Manipulation Language consists of methods for insertion, update
and deletion.
1 def add ( t : Tuple ) : Table =
2 def update ( atr : String , newVal : ValueType , matchVal : ValueType ) : Boolean =
3 def update ( atr : String , func : ValueType = > ValueType , matchVal : ValueType ) : Boolean =
4 def delete ( predicate : Predicate ) : Boolean =

Using the operator += as an alias for the add method the following code may be used to populate the Bank
database.
1 customer + = ( " Peter " , " Oak St " , " Bogart " )
2 + = ( " Paul " , " Elm St " , " Watkinsville " )
3 + = ( " Mary " , " Maple St " , " Athens " )
4 customer . show ()
5

6 branch + = ( " Alps " , 20000000.0 , " Athens " )


7 + = ( " Downtown " , 30000000.0 , " Athens " )
8 + = ( " Lake " , 10000000.0 , " Bogart " )
9 branch . show ()
10

11 deposit + = (11 , 2000.0 , " Peter " , " Lake " )


12 += (12 , 1500.0 , " Paul " , " Alps " )
13 += (13 , 2500.0 , " Paul " , " Downtown " )
14 += (14 , 2500.0 , " Paul " , " Lake " )
15 += (15 , 3000.0 , " Mary " , " Alps " )
16 += (16 , 1000.0 , " Mary " , " Downtown " )

119
17 deposit . show ()
18

19 loan + = (21 , 2200.0 , " Peter " , " Alps " )


20 + = (22 , 2100.0 , " Peter " , " Downtown " )
21 + = (23 , 1500.0 , " Paul " , " Alps " )
22 + = (24 , 2500.0 , " Paul " , " Downtown " )
23 + = (25 , 3000.0 , " Mary " , " Alps " )
24 + = (26 , 1000.0 , " Mary " , " Lake " )
25 loan . show ()

4.2.3 Relational Algebra


Relational Algebra provides a set of operators for writing queries on tables, including extracting columns
(project), rows (select), performing set operations on tables (union, minus, intersect and Cartesian product).
Several forms of join operations for composing a new table from two existing tables are provided as well as
division, group-by and order-by operations.

Table 4.1: Relational Algebra Operators (Tables r and r2)

Operator Unicode Signature


rename ρ def ρ (newName: String): T = rename (newName)
project π def π (x: String): T = project (strim (x))
project π def π (cPos: IndexedSeq [Int]): T = project (cPos)
selproject σπ def σπ (a: String, apred: APredicate): T = selproject (a, apred)
select σ def σ (a: String, apred: APredicate): T = select (a, apred)
select σ def σ (predicate: Predicate): T = select (predicate)
select σ def σ (condition: String): T = select (condition)
select σ def σ (pkey: KeyType): T = select (pkey)
union ∪ def ∪ (r2: T): T = union (r2)
minus − def − (r2: T): T = minus (r2)
intersect ∩ def ∩ (r2: T): T = intersect (r2)
product × def × (r2: T): T = product (r2)
join ./ def ./ (predicate: Predicate2, r2: T): T = join (predicate, r2)
join ./ def ./ (condition: String, r2: T): T = join (condition, r2)
join ./ def ./ (x: String, y: String, r2: T): T = join (strim (x), strim (y), r2)
join ./ def ./ (fkey: (String, T)): T = join (fkey)
join ./ def ./ (r2: T): T = join (r2)
leftJoin n def n (x: Schema, y: Schema, r2: T): T = leftJoin (x, y, r2)
rightJoin o def o (x: Schema, y: Schema, r2: T): T = rightJoin (x, y, r2)
divide / def / (r2: T): T = divide (r2)
groupBy γ def γ (ag: String): T = groupBy (ag)
aggregate F def F (ag: String, f as: (AggFunction, String)*): T = aggregate (ag, f as : *)
orderBy ↑ def ↑ (x: String*): T = orderBy (x : *)
orderByDesc ↓ def ↓ (x: String*): T = orderByDesc (x : *)(true)

120
Fundamental Relational Algebra Operators

The following six relational algebra operators form the fundamental operators for ScalaTion’s table pack-
age and are shown in Table 4.1. They are fundamental in sense that rest of operators, although convenient,
do not increase the power of the query language.

1. Rename Operator. The rename operator renames table customer to client.

customer ρ (“client”)

1 customer ρ ( " client " )

2. Project Operator. The project operator will return the specified columns in table customer.

πstreet, ccity (customer)

1 customer π ( " street , ccity " )

3. Select Operator. The select operator will return the rows that match the predicate in table customer.

σccity == ‘Athens0 (customer)

1 customer σ ( " ccity = = ’ Athens ’" )

4. Union Operator. The union operator will return the union of rows from deposit and loan. Duplicate
tuples may be eliminated by creating an index. For this operator the textbook syntax and ScalaTion
syntax are identical.

deposit ∪ loan

1 deposit ∪ loan

5. Minus Operator. The minus operator will return the rows from account (result of the union) that
are not in loan. For this operator the textbook syntax and ScalaTion syntax are identical.

account − loan

1 account - loan

6. Cartesian Product Operator. The product operator will return all combinations of rows in customer
with rows in deposit. For this operator the textbook syntax and ScalaTion syntax are identical.

customer × deposit

1 customer × deposit

121
Additional Relational Algebra Operators

The next eight operators, although not fundamental, are important operators in SacalaTion’s table
package and are shown in Table 4.1.

1. Join Operator. In order to combine information from two tables, join operators are preferred over
products, as they are much more efficient and only combine related rows. ScalaTion’s table package
supports natural-join, equi-join, theta-join, left outer join, and right outer join, as shown below. For
each tuple in the left table, the equi-join pairs it with all tuples in the right table that match it on
the given attributes (in this case customer.bname = deposit.bname). The natural-join is an equi-
join on the common attributes in the two tables, followed by projecting away any duplicate columns.
The theta-join generalizes an equi-join by allowing any comparison operator to be used (in this case
deposit1 .balance < deposit2 .balance). The symbol for semi-join is adopted for outer joins as it is a
Unicode symbol. The left join keeps all tuples from the left (null padding if need be), while the right
join keeps all tuples from the right table.

customer ./ deposit natural − join


customer ./cname == cname deposit equi − join
deposit ./balance < balance deposit theta − join
customer n deposit left outer join
customer o deposit right outer join

1 customer ./ deposit
2 customer ./ ( " cname = = cname " , deposit )
3 deposit ./ ( " balance < balance " , deposit )
4 customer n deposit
5 customer o deposit

Additional forms of joins are also available in the Table class. Join is not fundamental as its result
can be made by combining product and select.

2. Divide Operator. For the query below, the divide operator will return the cnames where the cus-
tomers has a deposit account at all branches (of course it would make sense to first select on the
branches).

πcname, bname (deposit)/πbname (branch)

1 deposit .π ( " cname , bname " ) / branch .π ( " bname " )

The divide operator requires the other attributes (in this case cname) in the left table to be paired up
with all the attribute values (in this case bname) in the right table.

3. Intersect Operator. The intersect operator will return the rows in account that are also in loan.
For this operator the textbook syntax and ScalaTion syntax are identical.

account ∩ loan

122
1 account ∩ loan

Intersection is not fundamental as its result can be made by successive minuses.

4. GroupBy Operator. The groupBy operator forms groups among the relation based on the equality
of attributes. The following example groups the tuples in the deposit table based on the value of the
bname attribute.

γbname (deposit)

1 deposit γ " bname "

5. Aggregate Operator. The aggregate operator returns values for the grouped-by attribute (e.g.,
bname) and applies aggregate operators on the specified columns (e.g., avg (balance)). Typically it is
called after the groupBy operator.

Fbname, count(accno), avg(balance) (deposit)

1 deposit F ( " bname " , ( count , " accno " ) , ( avg , " balance " ) )

6. OrderBy Operator. The orderBy operator effectively puts the rows into ascending order based on
the given attributes.

↑bname (deposit)

1 deposit ↑ " bname "

7. OrderByDesc Operator. The orderByDesc operator effectively puts the rows into descending order
based on the given attributes.

↓bname (deposit)

1 deposit ↓ " bname "

8. Select-Project Operator. The selproject is a combination operator added for convenience and
efficiency, especially for columnar relation databases (see the next section). As whole columns are
stored together, this operator only requires one column to be accessed.
1 customer σπ ( " ccity " , _ = = ’ Athens ’)

123
4.2.4 Example Queries
1. List the names of customers who live in the city of Athens.
1 val liveAthens = customer .σ ( " ccity = = ’ Athens ’" ) .π ( " cname " )
2 liveAthens . show ()

2. List the names of customers who live in Athens or bank (have deposits in branches located) in Athens.
1 val bankAthens = ( deposit ./ branch ) .σ ( " bcity = = ’ Athens ’" ) .π ( " cname " )
2 bankAthens . show ()

3. List the names of customers who live and bank in the same city.
1 val sameCity = ( customer ./ deposit ./ branch ) .σ ( " ccity = = bcity " ) .π ( " cname " )
2 sameCity . create_index ()
3 sameCity . show ()

4. List the names and account numbers of customers with the largest balance.
1 val largest = deposit .π ( " cname , accno " ) - ( deposit ./ ( " balance < balance " ,
deposit ) ) .π ( " cname , accno " )
2 largest . show ()

5. List the names of customers who are silver club members (have loans where they have deposits).
1 val silver = ( loan .π ( " cname , bname " ) ∩ deposit .π ( " cname , bname " ) ) .π ( " cname " )
2 silver . create_index ()
3 silver . show ()

6. List the names of customers who are gold club members (have loans only where they have deposits).
1 val gold = loan .π ( " cname " ) - ( loan .π ( " cname , bname " ) - deposit .π ( " cname , bname " )
) .π ( " cname " )
2 gold . create_index ()
3 gold . show ()

7. List the names of branches located in Athens.


1 val inAthens = branch .σ ( " bcity = = ’ Athens ’" ) .π ( " bname " )
2 inAthens . show ()

8. List the names of customers who have deposits at all branches located in Athens.
1 val allAthens = deposit .π ( " cname , bname " ) / inAthens
2 allAthens . create_index ()
3 allAthens . show ()

9. List the branch names and their average balances.


1 val avgBalance = deposit .γ ( " bname " ) . aggregate ( " bname " , ( count , " accno " ) , ( avg , "
balance " ) )
2 avgBalance . show ()

124
4.2.5 Persistence
Modern databases do much of the processing in main-memory due to its large size and high speed. Although
using MRAM, main-memories may be persistent, typically they are volatile, meaning if the power is lost,
so is the data. It is therefore essential to provide efficient mechanisms for making and maintaining the
persistence of data.
Traditional database management systems achieve this by having a persistence data store in non-volatile
storage (e.g., Hard-Disk Drives (HDD) or Solid-State Devices (SSD)) and a large database cache in main-
memory. Complex page management algorithms are used to ensure persistence and transactional correctness
(see the next subsection).
A simple way to provide persistence is to design the database management system to operate in main-
memory and then provide load and save methods that utilize built-in serialization to save to or load from
persistent storage. This is what ScalaTion does.
The load method will read a table with a given name into main-memory using serialization.
1 @ param name the name of the table to load
2

3 def load ( name : String ) : Table =


4 val ois = new O b j e c t I n p u t S t r e a m ( new Fi le In p ut St r ea m ( STORE_DIR + name + SER ) )
5 val tab = ois . readObject . asInstanceOf [ Table ]
6 ois . close ()
7 tab
8 end load

The save method will write the entire contents of this table into a file using serialization.
1 def save () : Unit =
2 val oos = new O b j e c t O u t p u t S t r e a m ( new F i l e O u t p u t S t r e a m ( STORE_DIR + name + SER ) )
3 oos . writeObject ( this )
4 oos . close ()
5 end save

For small databases, this approach is fine, but as database become large, greater efficiency must be
sought. One cannot save a whole table ever time there is a change. See the exercises for alternatives.

4.2.6 Transactions
The idea of a transaction is to bundle a sequence of operations into a meaningful action that one wants to
succeed, such as transferring money from one bank account to another.
Making the action a transaction has the main benefit of making it atomic, the action either completes
successfully (called a commit) or is completely undone having no effect on the database state (called a
rollback). The third option, a partially completed action in this case would lead to a bank customer losing
their money.
Making a transaction atomic can be achieved by maintaining a log. Operations can be written to the log
and then only saved once the transaction commits. If a transaction cannot commit, it must be rolled back.
There must also be a recover procedure to handles the situation when volatile storage is lost. For this to
function, committed log records must be flushed to persistent storage.
A second important advantage of making an action a transaction is to protect it from other transactions,
so it can think of itself as if it is running in isolation. Rather than worrying about how other transactions

125
may corrupt the action, this worry is turned over to database management system to handle it. One
form of potential interference involves two transactions running concurrently and accessing the same back
accounts. It one transaction accesses all the accounts first, there will be no corruption. Such an execution
of two transactions is called a serial execution (one transaction executes at a time). Unfortunately, modern
high-performance database management systems could not operate at the slow speed this would dictate.
Transaction must be run concurrently, not serially. The correction condition caller serializability allows
transaction to run with their concurrency controlled by a protocol that ensures their effects on the database
are equivalent to one of their slow-running, serially-executing cousin schedules. In other words, the fast
running serializable schedule for a set of transactions must be equivalent to some serial execution of the
same set of transactions. See the exercise for more details on equivalence (e.g., conflict and view equivalence)
and various concurrency control protocols that can be used to ensure correctness with minimal impact on
performance.

4.2.7 Table Class

1 @ param name the name of the table


2 @ param schema the attributes for the table
3 @ param domain the domains / data - types for the attributes ( ’D ’ , ’I ’ , ’L ’ , ’S ’ , ’X ’ , ’T ’)
4 @ param key the attributes forming the primary key
5

6 class Table ( override val name : String , override val schema : Schema ,
7 override val domain : Domain , override val key : Schema )
8 extends Tabular [ Table ] ( name , schema , domain , key )
9 with Serializable :

Internally, the Table class maintains a collection of tuples. Using a Bag allows for duplicates, if wanted.
Creating an index on the primary will efficiently eliminate any duplicates. Foreign key relationships are
specified in linkTypes. It also provides a groupMap used by the groupBy operator.
The Table class supports three types of indices:
1. Primary Index. A uniques index on the primary key (may be composite).
1 private [ table ] val index = IndexMap [ KeyType , Tuple ] ()

2. Secondary Unique Indices. A unique index on a single attribute (other than the primary key). For
example, a student id may be used as a primary for a Student table, while email may also be required
to be unique. Since there can be multiple such indices a Map is used to name each index.
1 private [ table ] val sindex = Map [ String , IndexMap [ ValueType , Tuple ]] ()

3. Non-Unique Indices. When fast-lookup is required based on an attribute/column that is not required
to be unique (e.g., name) such an index may be used. Again, since there can be multiple such indices
a Map is used to name each index.
1 private [ table ] val mindex = Map [ String , MIndexMap [ ValueType , Tuple ]] ()

The following methods may be used to create the various types of indices: primary unique index, secondary
unique index, or non-unique index, respectively.

126
1 def create_index ( rebuild : Boolean = false ) : Unit =
2 def create_sindex ( atr : String ) : Unit =
3 def create_mindex ( atr : String ) : Unit =

The following factory method in the companion object provides a more convenient way to create a table.
The strim method splits a string into an array of strings based on a separation character and then trims
away any white-space.
1 def apply ( name : String , schema : String , domain_ : String , key : String ) : Table =
2 new Table ( name , strim ( schema ) , strim ( domain_ ) . map ( _ . head ) , strim ( key ) )
3 end apply

The following two classes extend the Table class in the direction of the Graph Data Model, see Appendix
C.

4.2.8 LTable Class

1 @ param name_ the name of the linkable - table


2 @ param schema_ the attributes for the linkable - table
3 @ param domain_ the domains / data - types for attributes ( ’D ’ , ’I ’ , ’L ’ , ’S ’ , ’X ’ , ’T ’)
4 @ param key_ the attributes forming the primary key
5

6 case class LTable ( name_ : String , schema_ : Schema , domain_ : Domain , key_ : Schema )
7 extends Table ( name_ , schema_ , domain_ , key_ )
8 with Serializable :

The LTable class (for Linked-Table) simply adds an explicit link from the foreign key to the primary key
that it references. For each tuple in a linked-table, add a link to the referenced table, so that the foreign key
is linked to the primary key. Caveat: LTable does not handle composite foreign keys. Although in general
primary keys may be composite, a foreign key is conceptualized as a column value and its associated link.
1 @ param fkey the foreign key column
2 @ param refTab the referenced table being linked to
3

4 def addLinks ( fkey : String , refTab : Table ) : Unit =

The LTable class makes many-to-one relationships/associations explicit and improves the efficiency of
the most common form of join operation which is based on equating a foreign key (fkey) to a primary key
(pkey). Without an index, these are performed using a Nest-Loop Join algorithm. The existence of an index
on the primary key allows a much more efficient Indexed Join algorithm to be utilized. The direct linkage
provides for additional speed up of such join operations (see the exercises for a comparison). Note that the
linkage is only in one direction, so joining from the primary key table to the foreign key table would require
a non-unique index on the foreign key column, or resorting to a slow nested loop join.
Note, the link and foreign key value are in some sense redundant. Removing the foreign key column is
possible, but may force the need for an additional join for some queries, so the database designer may wish
to keep the foreign key column. ScalaTion leaves this issue up to the database designer.
The next class moves further in the direction of the Graph Data Model.
4.2.9 VTable Class

1 @ param name_ the name of the vertex - table


2 @ param schema_ the attributes for the vertex - table

127
3 @ param domain_ the domains / data - types for attributes ( ’D ’ , ’I ’ , ’L ’ , ’S ’ , ’X ’ , ’T ’)
4 @ param key_ the attributes forming the primary key
5

6 case class VTable ( name_ : String , schema_ : Schema , domain_ : Domain , key_ : Schema )
7 extends Table ( name_ , schema_ , domain_ , key_ )
8 with Serializable :

The VTable class (for Vertex-Table) supports many-to-many relationships with efficient navigation in
both directions. Supporting this is much more completed than what is needed for LTable, but provides for
index-free adjacency, similar to what is provided by Graph Database systems.
The VTable model is graph-like in that it elevates tuples into vertices as first-class citizens of the data
model. However, edges are embedded inside of vertices and are there to establish adjacency. Edges do not
have labels, attributes or properties. Although this simplifies the data model and makes it more relation-like,
it is not set up to naturally support finding for example shortest paths.
The Vertex class extends the notion of Tuple into values stored in the tuple part, along with foreign
keys links captured as outgoing edges.
1 @ param tuple the tuple part of a vertex
2

3 case class Vertex ( tuple : Tuple ) :


4

5 val edge = Map [ String , Set [ Vertex ]] ()


6

7 end Vertex

For data models where edges become first-class citizens, see the Appendix on Graph Data Models.

128
4.3 Columnar Relational Data Model
Of the NoSQL database management systems, columnar databases are closest to traditional relational
databases. Rather than tuples/rows taking center stage, columns/vectors take center stage.
A columnar database is made up of the following components:

• Element - a value from a given Domain or Datatype (e.g., Int, Long, Double, Rational, Real, Complex,
String, TimeNum)

• Column/Vector - a collection of values from the same Datatype (e.g., forming VectorI, VectorL,
VectorD, VectorQ, VectorR, VectorC, VectorS, VectorT)

• Columnar Relation - a heterogeneous collection of columns/vectors put into a table-like structure.

• Columnar Database - a collection of columnar relations.

Table 4.2 shows the first 10 rows (out of 392) for the well-known Auto MPG dataset (see https://
archive.ics.uci.edu/ml/datasets/Auto+MPG).

Table 4.2: Example Columnar Relation: First 10 Rows of Auto MPG Dataset

mpg cylinders displacement horsepower weight acceleration model year origin car name
Double Int Double Double Double Double Int Int String
18.0 8 307.0 130.0 3504.0 12.0 70 1 ”chevrolet chevelle”
15.0 8 350.0 165.0 3693.0 11.5 70 1 ”buick skylark 320”
18.0 8 318.0 150.0 3436.0 11.0 70 1 ”plymouth satellite”
16.0 8 304.0 150.0 3433.0 12.0 70 1 ”amc rebel sst”
17.0 8 302.0 140.0 3449.0 10.5 70 1 ”ford torino”
15.0 8 429.0 198.0 4341.0 10.0 70 1 ”ford galaxie 500”
14.0 8 454.0 220.0 4354.0 9.0 70 1 ”chevrolet impala”
14.0 8 440.0 215.0 4312.0 8.5 70 1 ”plymouth fury iii”
14.0 8 455.0 225.0 4425.0 10.0 70 1 ”pontiac catalina”
15.0 8 390.0 190.0 3850.0 8.5 70 1 ”amc ambassador dpl”

Since each column is stored as a vector, they can be readily compressed. Due to the high repetition in the
cylinders column it can be effectively compressed using Run Length Encoding (RLE) compression. In
addition, a column can be efficiently extracted since it already stored as a vector in the database. These
vectors can be used in aggregate operators or passed into analytic models.
Data files in various formats (e.g., comma separated values (csv)) can be loaded into the database.
1 val auto_mpg = Relation ( " auto_mpg " , " auto_mpg . csv " )

It is easy to create a Multiple Linear Regression model for this dataset. Simply pick the response column,
in this case mpg and the predictor columns, in this case all other columns besides car name. The connection
between car name and mpg is coincidental. The response column/variable goes into a vector.
1 val y = auto_mpg . toVectorD (0)

The predictor columns/variables goes into a matrix.

129
1 val x = auto_mpg . toMatrixD (1 to 7) )

Then the matrix x and vector y can be passed into a Regression model constructor.
1 val rg = new Regression (x , y )

See the next chapter for how to train a model, evaluate the quality of fit and make predictions.
The first API is a Columnar Relational Algebra that includes the standard operators of relational algebra
plus those common to column-oriented databases. It consists of the Table trait and two implementing classes:
Relation and MM Relation. Persistence for Relation is provided by the save method, while MM Relation
utilizes memory-mapped files.

4.3.1 Data Definition Language


A Relation object is created by invoking a constructor or factory apply function. For example, the following
six Relations may be useful in a traffic forecasting study.
1 val sensor = Relation ( " sensor " ,
2 Seq ( " sensorId " , " model " , " latitude " , " longitude " , " roadId " ) ,
3 Seq () , 0 , " ISDDI " )
4 val road = Relation ( " road " ,
5 Seq ( " roadId " , " rdName " , " lat1 " , " long1 " , " lat2 " , " long2 " ) ,
6 Seq () , 0 , " ISDDDD " )
7 val mroad = Relation ( " mroad " ,
8 Seq ( " roadId " , " rdName " , " lanes " , " lat1 " , " long1 " , " lat2 " , " long2 " ) ,
9 Seq () , 0 , " ISIDDDD " )
10 val traffic = Relation ( " traffic " ,
11 Seq ( " time " , " sensorId " , " count " " speed " ) ,
12 Seq () , Seq (0 , 1) , " TIID " )
13 val wsensor = Relation ( " wsensor " ,
14 Seq ( " sensorId " , " model " , " latitude " , " longitude " ) ,
15 Seq () , 0 , " ISDD " )
16 val weather = Relation ( " weather " ,
17 Seq ( " time " , " sensorId " , " precipitation " " wind " ) ,
18 Seq () , Seq (0 , 1) , " TIDD " )

The name of the first relation is “sensor” and it stores information about traffic sensors.

• The first argument is the name of the relation (name).

• The second argument is the sequence of attribute/column names (colName).

• The third argument is the sequence of data, currently empty (col),

• The fourth argument is the column number for the primary key (key),

• The fifth argument, “ISDDI”, indicates the domains (domain) for the attributes (Integer, String,
Double, Double, Integer).

• The sixth and optional argument can be used to define foreign keys (fKeys).

• The seventh and optional argument indicates whether to enter that relation is the system Catalog.

130
The second relation road stores the Id, name, beginning and ending latitude-longitude coordinates.
The third relation mroad is for multi-lane roads.
The fourth relation traffic stores the data collected from traffic sensors. The primary key in this case
is composite, Seq (0, 1), as both the time and the sensorId are required for unique identification.
The fifth relation wsensor stores information about weather sensors.
Finally, the sixth relation weather stores data collected from the weather sensors.

4.3.2 Data Manipulation Language


There are several ways to populate the Relations. A row/tuple can be added one at a time using def
add (tuple: Row). Population may also occur during relation construction (via a constructor or apply
method). There are factory apply functions that take a file or URL as input.
For example to populate the sensor relation with information about Austin, Texas’ traffic sensors stored
in the file austin traffic sensors.csv the following line of code may be used.
1 val sensor = Relation ( ‘ ‘ sensor " , ‘‘ austin \ _traffic \ _sensors . csv " )

Data files are stored in subdirectories of ScalaTion’s data directory.

4.3.3 Columnar Relational Algebra


Table 4.3 shows the thirteen operators supported (the first six are considered fundamental). Operator
names as well as Unicode symbols may be used interchangeably (e.g., r union s or r ∪ s compute the union
of relations r and s). Note, the extended projection operator eproject (Π) provides a convenient mechanism
for applying aggregate functions. It is often called after the groupBy operator, in which case multiple rows
will be returned. Multiple columns may be specified in eproject as well. There are also several varieties of
join operators. As an alternative to using the Unicode symbol when they are Greek letters, the letter may
be written out in English (pi, sigma, rho, gamma, epi, omega, zeta, unzeta).
The subsections below present the columnar relational algebra operators, first showing the textbook
notation followed by the syntax in ScalaTion’s column db package. To make the examples complex more
concise, let r = road, s = sensor, t = traffic, q = mroad, v = wsensor and w = weather.

Select Operator

The select operator will return the rows that match the predicate, in this case rdN ame == “I285”.

σrdN ame==“I285” (r)

r.σ(“rdN ame”, == “I285”)

Project Operator

The project operator will return the specified columns, in this case rdN ame, lat1, long1.

πrdN ame,lat1,long1 (r)

r.π(“rdN ame”, “lat1”, “long1”)

131
Union Operator

The union operator will return the rows from r and s with no duplicates. For this operator the textbook
syntax and column db syntax are identical.

r∪s

Minus Operator

The minus operator will return the rows from r that are not in s. For this operator the textbook syntax and
column db syntax are identical.

r−s

Cartesian Product Operator

The product operator will return all combinations of rows in r with rows in s. For this operator the textbook
syntax and column db syntax are identical.

r×s

Rename Operator

The rename operator renames relation r’s name to r2.

r.ρ(“r2”)

The above six operators form the fundamental operators for SacalaTion’s column db package and are
shown as the first group in Table 4.3.

Table 4.3: Columnar Relational Algebra (r = road, s = sensor, t = traffic, q = mroad, w = weather)

Operator Unicode Example Return


select σ r.σ (“rdName”, == “I285”) rows of r where rdName == “I285”
project π r.π (“rdName”, “lat1”, “long1”) the rdName, lat1, and long1 columns of r
union ∪ r∪q rows that are in r or q
minus - r−q rows that are in r but not q
product × r×t concatenation of each row of r with those of t
rename ρ r.ρ(“r2”) a copy of r with new name r2
join ./ r ./ s rows in natural join of r and s
intersect ∩ r∩q rows that are in r and q
groupBy γ t.γ (“sensorId”) rows of t grouped by sensorId
eproject Π t.Π (avg, “acount”, “count”)(“sensorId”) the average of the count column of t
orderBy ω t.ω (“sensorId”) rows of t ordered by sensorId
compress ζ t.ζ (“count”) compress the count column of t
uncompress Z t.Z (“count”) uncompress the count column of t

132
The next seven operators, although not fundamental, are important operators in SacalaTion’s column db
package and are shown as the second group in Table 4.3.

Join Operators

In order to combine information from two relations, join operators are preferred over products, as they are
much more efficiently and only combine related rows. ScalaTion’s column db package supports natural-
join, equi-join, general theta join, left outer join, and right outer join, as shown below.

r ./ s natural − join
r ./ (“roadId”, “roadId”, s) equi − join
r ./ [Int](s, (“roadId”, “roadId”, == )) theta join
t n (“time”, “time”, w) left outer join
t o (“time”, “time”, w) right outer join

Intersect Operator

The intersect operator will return the rows in r that are also in s. For this operator the textbook syntax
and column db syntax are identical.

r∩s

GroupBy Operator

The groupBy operator forms groups among the relation based on the equality of attributes. The following
example groups traffic data based in the value of the “sensorId” attribute.

t.γ(“sensorId”)

Extended Projection Operator

The extended projection operator eproject applies aggregate operators on aggregation columns (first argu-
ments) and regular project on the other columns (second arguments). Typically it is called after the groupBy
operator.

t.γ(“sensorId”).Π(avg, “acount”, “count”)(“sensorId”)

OrderBy Operator

The orderBy operator effectively puts the rows into ascending (descending) order based on the given at-
tributes.

t.ω(“sensorId”)

133
Compress Operator

The compress operator will compress the given columns of the relation.

t.ζ(“count”)

Uncompress Operator

The uncompress operator will uncompress the given columns of the relation.

t.Z(“count”)

4.3.4 Example Queries


Several example queries for the traffic study are given below.

1. Retrieve the automobile mileage data for cars with 8 cylinders.

auto mpg.select (“cylinders”, == 8)

Note, select and σ may be use interchangeably

2. Retrieve the automobile mileage data for cars with 8 cylinders, returning the car name and mpg.

auto mpg.select (“cylinders”, == 8).project (“car name”, “mpg”)

Note, project and π may be use interchangeably

3. Retrieve traffic data within a 100 kilometer-grid from the center of Austin, Texas. The latitude-
longitude coordinates for Austin, Texas are (30.266667, -97.733333).

val austin = latLong2UTMxy (LatitudeLongitude (30.266667, -97.733333))


val alat = (austin. 1 - 100000, austin. 1 + 100000)
val along = (austin. 2 - 100000, austin. 2 + 100000)
traffic ./ sensor.σ [Double] (“latitude”, ∈ alat).σ [Double] (“longitude” ∈ along)

4.3.5 Relation Class

Class Methods:
1 @ param name the name of the relation
2 @ param colName the names of columns
3 @ param col the Scala Vector of columns making up the columnar relation
4 @ param key the column number for the primary key ( < 0 = > no primary key )
5 @ param domain an optional string indicating domains for columns ( e . g . , ’ SD ’ = ’ String ’
, ’ Double ’)
6 @ param fKeys an optional sequence of foreign keys
7 - Seq ( column name , ref table name , ref column position )
8 @ param enter whether to enter the newly created relation into the ‘ Catalog ‘
9

134
10 class Relation ( val name : String , val colName : Seq [ String ] , var col : Vector [ Vec ] =
null ,
11 val key : Int = 0 , val domain : String = null ,
12 var fKeys : Seq [( String , String , Int ) ] = null , enter : Boolean = true )
13 extends Table with Error with Serializable

135
4.4 SQL-Like Language
The SQL-Like API in ScalaTion provides many of the language constructs of SQL in a functional style.

4.4.1 Relation Creation


A RelationSQL object is created by invoking a constructor or factory apply function. For example, the
following six RelationSQLs may be useful in a traffic forecasting study.
1 val sensor = RelationSQL ( " sensor " ,
2 Seq ( " sensorId " , " model " , " latitude " , " longitude " , " roadId " ) ,
3 null , 0 , " ISDDI " )
4 val road = RelationSQL ( " road " ,
5 Seq ( " roadId " , " rdName " , " lat1 " , " long1 " , " lat2 " , " long2 " ) ,
6 null , 0 , " ISDDDD " )
7 val mroad = RelationSQL ( " mroad " ,
8 Seq ( " roadId " , " rdName " , " lanes " , " lat1 " , " long1 " , " lat2 " , "
long2 " ) ,
9 null , 0 , " ISIDDDD " )
10 val traffic = RelationSQL ( " traffic " ,
11 Seq ( " time " , " sensorId " , " count " , " speed " ) ,
12 null , 0 , " TIID " )
13 val wsensor = RelationSQL ( " wsensor " ,
14 Seq ( " sensorId " , " model " , " latitude " , " longitude " ) ,
15 null , 0 , " ISDD " )
16 val weather = RelationSQL ( " weather " ,
17 Seq ( " time " , " sensorId " , " precipitation " , " wind " ) ,
18 null , 0 , " TIDD " )

4.4.2 Sample Queries


The ScalaTion columnar database provides a functional SQL-like query language.

1. Retrieve the vehicle traffic counts over time from all sensors on the road with Id = 101.
1 ( traffic join sensor ) . where [ Int ] ( " roadId " , _ = = 101)
2 . select ( " sensorId " , " time " , " count " )

In SQL, this would be written as follows:


1 select sensorId , time , count
2 from traffic natural join sensor
3 where roadId = = 101

2. Retrieve the vehicle traffic counts averaged over time from all sensors on the road with Id = 101.
1 ( traffic join sensor ) . where [ Int ] ( " roadId " , _ = = 101)
2 . groupBy ( " sensorId " )
3 . eselect (( avg , " acount " , " count " ) ) ( " sensorId " )

136
4.4.3 RelationSQL Class

Class Methods:
1 @ param name the name of the relation
2 @ param colName the names of columns
3 @ param col the Scala Vector of columns making up the columnar relation
4 @ param key the column number for the primary key ( < 0 = > no primary key )
5 @ param domain an optional string indicating domains for columns ( e . g . , ’ SD ’ = ’ String ’
, ’ Double ’)
6 @ param fKeys an optional sequence of foreign keys - Seq ( column name , ref table name ,
ref column position )
7

8 class RelationSQL ( name : String , colName : Seq [ String ] , col : Vector [ Vec ] ,
9 key : Int = 0 , domain : String = null , fKeys : Seq [( String , String , Int
) ] = null )
10 extends Tabular with Serializable
11

12 def repr : Relation = r


13 def this ( r : Relation ) = this ( r . name , r . colName , r . col , r . key , r . domain , r . fKeys )
14 def select ( cName : String *) : RelationSQL =
15 def eselect ( aggCol : AggColumn *) ( cName : String *) : RelationSQL =
16 def join ( r2 : RelationSQL ) : RelationSQL =
17 def join ( cName1 : String , cName2 : String , r2 : RelationSQL ) : RelationSQL =
18 def join ( cName1 : Seq [ String ] , cName2 : Seq [ String ] , r2 : RelationSQL ) : RelationSQL =
19 def where [ T : ClassTag ] ( cName : String , p : T = > Boolean ) : RelationSQL =
20 def where2 [ T : ClassTag ] ( p : Predicate [ T ]*) : RelationSQL =
21 def groupBy ( cName : String *) : RelationSQL =
22 def orderBy ( cName : String *) : RelationSQL =
23 def orderByDesc ( cName : String *) : RelationSQL =
24 def union ( r2 : RelationSQL ) : RelationSQL =
25 def intersect ( r2 : RelationSQL ) : RelationSQL =
26 def intersect2 ( r2 : RelationSQL ) : RelationSQL =
27 def minus ( r2 : RelationSQL ) : RelationSQL =
28 def minus2 ( r2 : RelationSQL ) : RelationSQL =
29 def stack ( cName1 : String , cName2 : String ) : RelationSQL =
30 def insert ( rows : Row *)
31 def materialize ()
32 def exists : Boolean = r . exists

1 def toMatrixD ( colPos : Seq [ Int ] , kind : MatrixKind = DENSE ) : MatrixD =


2 def toMatrixDD ( colPos : Seq [ Int ] , colPosV : Int , kind : MatrixKind = DENSE ) : ( MatrixD ,
VectorD ) =
3 def toMatrixDI ( colPos : Seq [ Int ] , colPosV : Int , kind : MatrixKind = DENSE ) : ( MatrixD ,
VectorI ) =
4 def toMatrixI ( colPos : Seq [ Int ] , kind : MatrixKind = DENSE ) : MatrixI =
5 def toMatrixI2 ( colPos : Seq [ Int ] = null , kind : MatrixKind = DENSE ) : MatrixI =
6 def toMatrixII ( colPos : Seq [ Int ] , colPosV : Int , kind : MatrixKind = DENSE ) : ( MatrixI ,
VectorI ) =
7 def toVectorD ( colPos : Int = 0) : VectorD = r . toVectorD ( colPos )
8 def toVectorD ( colName : String ) : VectorD = r . toVectorD ( colName )
9 def toVectorI ( colPos : Int = 0) : VectorI = r . toVectorI ( colPos )
10 def toVectorI ( colName : String ) : VectorI = r . toVectorI ( colName )
11 def toVectorL ( colPos : Int = 0) : VectorL = r . toVectorL ( colPos )

137
12 def toVectorL ( colName : String ) : VectorL = r . toVectorL ( colName )
13 def toVectorS ( colPos : Int = 0) : VectorS = r . toVectorS ( colPos )
14 def toVectorS ( colName : String ) : VectorS = r . toVectorS ( colName )
15 def toVectorT ( colPos : Int = 0) : VectorT = r . toVectorT ( colPos )
16 def toVectorT ( colName : String ) : VectorT = r . toVectorT ( colName )
17 def show ( limit : Int = Int . MaxValue ) r . show ( limit )
18 def save () r . save ()
19 def generateIndex ( reset : Boolean = false ) r . generateIndex ( reset )

138
4.5 Exercises
1. Use Scala 3 to complete the implementation of the following ScalaTion data models: Table, LTable,
and VTable in the scalation.table package. A group will work on one the data models. See Appendix
C for two more data models: GTable and PGraph.

• Test all the operators.


• Test all types of unique indices (IndexMap). Use the import scheme shown in the beginning of
Table.scala.

Table 4.4: Types of Indices (for Unique, Non-Unique Indices)

IndexMap MIndexMap Description


LinHashMap LinHashMultiMap ScalaTion’s Linear Hash Map
HashMap HashMultiMap Scala’s Hash Map
JHashMap JHashMultiMap Java’s Hash Map
BpTreeMap BpTreeMultiMap ScalaTion’s B+ Tree Map
TreeMap TreeMultiMap Scala’s Tree Map
JTreeMap JTreeMultiMap Java’s Tree Map

• Test all types of non-unique indices (MIndexMap). Use the import scheme shown in the beginning
of Table.scala.
• Add use of indexing to speed up as many operations as possible.
• Speed up joins by using Unique Indices and Non-Unique Indices.
• Use index-free adjacency when possible for further speed-up.
• Make the save operation efficient, by only serializing tuples/vertices that have changed since the
last load. One way to approach this would be to maintain a map in persistent storage,
1 Map [ KeyType , [ TimeNum , Tuple ]]

where the key for a tuple/vertex may be used to check the timestamp of a tuple/vertex. Unless
the timestamp of the volatile tuple/vertex is larger, there is no need to save it. Further speed
improvement may be obtained by switching from Java’s text-based serialization to Kryo’s binary
serialization.

2. Conflict vs. View Equivalence. TBD.

3. Comparison of Concurrency Control Protocols. TBD.

4. Create the sensor schema using the RelationSQL class in the columnar db package.

5. Populate the sensor database with sample data. See


https://data.austintexas.gov/Transportation-and-Mobility/Traffic-Count-Study-Area/cqdh-farx

6. Retrieve the sensors that are on I35.

139
7. Retrieve traffic data within a 100 kilometer-grid from the center of Austin, Texas. The latitude-
longitude coordinates for Austin, Texas are (30.266667, -97.733333).

8. Consider the following schema:


1 val student = Table ( " student " ," sid , sname , street , city , dept , level " ,
2 "I , S , S , S , S , I " , " sid " )
3 val professor = Table ( " professor " , " pid , pname , street , city , dept " ,
4 "I , S , S , S , S " , " pid " )
5 val course = Table ( " course " , " cid , cname , hours , dept , pid " ,
6 "I , X , I , S , I " , " cid " )
7 val takes = Table ( " takes " , " sid , cid " ,
8 "I , I " , " sid , cid " )

Formulate a relation algebra expression to list the names of the professors of courses taken by Peter.

140
Chapter 5

Data Preprocessing

5.1 Basic Operations


Using the ScalaTion TSDB, data scientists may write queries that extract data from one or more columnar
relations. These data are used to create vectors and matrices that may be passed to various analytics
techniques. Before the vectors and matrices are created the data need to be preprocessed to improve data
quality and transform the data into a form more suitable for analytics.

5.1.1 Remove Identifiers


Any column that is unique (e.g., a primary key) with arbitrary values should be removed before applying a
modeling/analytics technique. For example, an employee ID in a Neural Network analysis to predict salary
could result in a perfect fit. Upon knowing the employee ID, the salary is a known. As the ID itself (e.g.,
ID = 1234567) is arbitrary, such a model has little value.

5.1.2 Convert String Columns to Numeric Columns


In ScalaTion, columns with strings (VectorS) should be converted to integers. For displaying final results,
however, is often useful to convert the integers back to the original strings. The capabilities are provided by
the map2Int function in the VectorS class (see the section on RegressionCat).

5.1.3 Identify Missing Values


Missing Values are common is real datasets. For some datasets, a question mark character ‘?’ is used to
indicate that a value is missing. In Comma Separated Value (CSV) files, repeated commas may indicate
missing values, e.g., 10.1, 11.2,,,9.8. If zero or negative numbers are not valid for the application, these may
be used to indicate missing values.

5.1.4 Preliminary Feature Selection


Before selecting a modeling/analytics technique, certain columns may be thrown away. Examples include
columns with too many missing values or columns with near zero variance. Further discuss on this topic can
be found in the section on Exploratory Data Analysis (EDA).

141
5.2 Methods for Outlier Detection

Data points that are considered outliers may happen because of errors or highly unusual occurrences. For
example, suppose a dataset records the times for members of a football team to run a 100-yard dash and
one of the recorded values is 3.2 seconds. This is an outlier. Some analytics techniques are less sensitive to
outliers, e.g., `1 Regression, while others, e.g., `2 Regression, are more sensitive. Detection of outliers suffers
from the obvious problems of being too strict (in which case good data may be thrown away) or too lenient
(in which case outliers are passed to an analytics technique). One may choose to handle outliers separately,
or turn them into missing values, so that both outliers and missing values may be handled together.

5.2.1 Based on Standard Deviation

If measured values for a random variable xj are approximately Normally distributed and are several standard
deviation units away form the center (µxj ), they are rare events. Depending on the situation, this may be
important information to examine, but may often indicate incorrect measurement. Table 5.1 shows how
unlikely it is to obtain data points in distant tails of a Normal distribution. The standard way to detect
outliers using the standard deviation method is to examine points beyond three standard deviation (σxj )
units for being outliers. This is also called the z-score method as xj needs to be transformed to zj that
follows the Standard Normal distribution.

xj − µxj
zj = (5.1)
σxj

Table 5.1: Probabilities/Percentiles for the Standard Normal Distribution

± distance percent inside percent in tails outside per 10,000


0.67448 50.00 50.00 5000
1.00000 68.27 31.73 3173
1.50000 86.64 13.36 1336
1.95996 95.00 5.00 500
2.00000 95.45 4.55 455
2.50000 98.76 1.24 124
2.57500 99.00 1.00 100
2.70000 99.31 0.69 69
3.00000 99.73 0.27 27
3.50000 99.95 0.05 5

142
pdf for Standard Normal Distribution

0.4

0.3

fzj (z)
0.2

0.1

−4 −2 0 2 4
z

5.2.2 Based on InterQuartile Range


The InterQuartile Range (IQR) shown in green for the Standard Normal distribution is 1.34896 (±0.67448).
It includes the second (.25 Q [xj ]) and third (.75 Qxj ) quartiles, i.e., the middle two out of four quartiles. The
IQR gives a basic distance or yardstick for measuring when points are too far away from the median. A data
point xj should be examined as an outlier when the following rule is true.

xj ∈
/ [ .25 Q [xj ] − δ · IQR, .75 Q [xj ] + δ · IQR ] (5.2)
For the Normal distribution case, when the scale factor δ = 1.5, it corresponds to 2.69792 standard deviation
units and at 2.0 it corresponds to 3.3724 standard deviation units (see the exercises). The advantage of
this method over the previous one, is that it can work when the data points are not approximately Normal.
This includes the cases where the distribution is not symmetric (a problematic situation for the previous
method). A weakness of the IQR method occurs when data are concentrated near the median, resulting in
an IQR that is in some sense too small to be useful.
Use of Box-Plots provides visual support for looking for outliers. The IQR is shown as a box with whiskers
extending in both directions, extending δ ·IQR units beyond the box, with indications of locations of extreme
data points beyond the whiskers.

5.2.3 Based on Quantiles/Percentiles


A simple method for detecting outliers is to assume that the most extreme 1% of data points are outliers
(they may well not be). This would include the 0.5% smallest and 0.5% largest data points. Under the
Normality assumption this would correspond to 2.575 standard deviation units. Given that this method
does not look for how far points are from a mean or median, it should not be used as the sole evidence that
a data point is an outlier.
In the Outlier.scala file, ScalaTion currently provides the following techniques for outlier detection:

• Standard Deviation Method: data points too many standard deviation units (typically 2.5 to 3.5,
defaults to 2.7) away from the mean, DistanceOutlier;

143
• InterQuartile Range Method: data points a scale factor/expansion multiplier (typically 1.5 to 2.0,
defaults to 1.5) times the IQR beyond the middle two quartiles, QuartileXOutlier; and

• Quantiles/Percentile Method: data points in the extreme percentages (typically 0.7 to 10 percent,
defaults to 0.7), i.e., having the smallest or largest values, QuantileOutlier.

Note: These defaults put these three outlier detection methods in alignment when data points are approx-
imately Normally distributed.
The following function will turn outliers in missing values, by reassigning the outliers to noDouble,
ScalaTion’s indicator of a missing value of type Double.

DistanceOutlier.rmOutlier (traffic.column (“speed”))

An alternative to eliminating outliers during data preprocessing, is to eliminate them during modeling
by looking for extreme residuals. In addition to looking at the magnitude of a residual i , some argue only
to remove data points that also have high influence on the model’s parameters/coefficients, using techniques
such as DFFITS, Cook’s Distance, or DFBETAS [34].

144
5.3 Imputation Techniques
The two main ways to handle missing values are (1) throw them away, or (2) use imputation to replace them
with reasonable guesses. When there is a gap in time series data, imputation may be used for short gaps,
but is unlikely to be useful for long gaps. This is especially true when imputation techniques are simple. The
alternative could be to use an advanced modeling technique like SARIMA for imputation, but then results
of a modeling study using SARIMA are likely to be biased. Imputation implementations are based on the
Imputation trait in the scalation.modeling package.

5.3.1 Imputation Trait

Trait Methods:
1 trait Imputation
2

3 def setMissVal ( missVal_ : Double ) { missVal = missVal_ }


4 def setDist ( dist_ : Int ) { dist = dist_ }
5 def imputeAt ( x : VectorD , i : Int ) : Double
6 def impute ( x : VectorD , i : Int = 0) : ( Int , Double ) = findMissing (x , i )
7 def imputeAll ( x : VectorD ) : VectorD =
8 def impute ( x : MatrixD ) : MatrixD =
9 def imputeCol ( c : Vec , i : Int = 0) : ( Int , Any ) =

ScalaTion currently supports the following imputation techniques:

1. object ImputeRegression extends Imputation: Use SimpleRegression on the instance index to


estimate the next missing value.

2. object ImputeForward extends Imputation: Use the previous value and slope to estimate the next
missing value.

3. object ImputeBackward extends Imputation: Use the subsequent value and slope to estimate the
previous missing value.

4. object ImputeMean extends Imputation: Use the filtered mean to estimate the next missing value.

5. object ImputeMovingAvg extends Imputation: Use the moving-average of the last ’dist’ values to
estimate the next missing value.

6. object ImputeNormal extends Imputation: Use the median of three Normally distributed, based
on filtered mean and variance, random values to estimate the next missing value.

7. object ImputeNormalWin extends Imputation: Same as ImputeNormal except mean and variance
are recomputed over a sliding window.

145
5.4 Align Multiple Time Series
When the data include multiple time series, there are likely to be time alignment problems. The frequency
and/or phase may not be in agreement. For example, traffic count data may be recorded every 15 minutes
and phased on the hour, while weather precipitation data may be collected every 30 minutes and phased to
10 minutes past the hour.
ScalaTion supports the following alignments techniques: (1) approximate left outer join and (2) dy-
namic time warping. The first operator will perform a left outer join between two relations based on their
time (TimeNum) columns. Rather than the usual matching based on equality, approximately equal times are
considered sufficient for alignment. For example, to align traffic data with the weather data, the following
approximate left outer join may be used.

traffic n (0.01)(“time”, “time”, weather) approximate left outer join

The second operator ...

146
5.5 Creating Vectors and Matrices
Once the data have been preprocessed, columns may be projected out to create a matrix that may be passed
to analytics/modeling techniques.

val mat = π“time”,“count” (traffic).toMatrixD

This matrix may then be passed into multiple modeling techniques: (1) a Multiple Linear Regression, (2) a
Auto-Regressive, Integrated, Moving-Average (ARIMA) model.

val model1 = Regression (mat)


val model2 = ARIMA (mat)

By default in ScalaTion the rightmost columns are the response/output variables. As many of the
modeling techniques have a single response variable, it will be assumed to in the last column. There are also
constructors and factory apply functions that take explicit vector and matrix parameters, e.g., a matrix of
predictor variables and a response vector.

147
5.6 Exercises
1. Assume random variable xj is distributed N (µ, σ).
(a) Show that when the scale factor δ = 1.5, the InterQuartile Range method corresponds to the
Standard Deviation method at 2.69792 standard deviation units.
(b) Show that when the scale factor δ = 2.0, the InterQuartile Range method corresponds to the
Standard Deviation method at 3.3724 standard deviation units.
(c) What should the scale factor δ need to be to correspond to 3 standard deviation units?

2. Randomly generate 10,000 data points from the Standard Normal distribution. Count how many of
these data points are considered as outliers for
(a) the Standard Deviation method set at 3.3724 standard deviation units, and
(b) the InterQuartile Range method with δ = 2.0.
(c) the Quantile/Percentile method set at what? percent.

3. Load the auto mpg.csv dataset into an auto mpg relation. Perform the preprocessing steps above to
create a cleaned-up relation auto mpg2 and produce a data matrix called auto mat from this relation.
Print out the correlation matrix for auto mat. Which columns have the highest correlation? To predict
the miles per gallon mpg which columns are likely to be the best predictors.

4. Find a dataset at the UCI Machine Learning Repository and carry out the same steps
https://archive.ics.uci.edu/ml/index.php.

148
Part II

Modeling

149
Chapter 6

Prediction

As the name predictive analytics indicates, the purpose of techniques that fall in this category is to develop
models to predict outcomes. For example, the distance a golf ball travels y when hit by a driver depends
on several factors or inputs x such as club head speed, barometric pressure, and smash factor (how square
the impact is). The models can be developed using a combination of data (e.g., from experiments) and
knowledge (e.g., Newton’s Second Law). The modeling techniques discussed in this technical report tend
to emphasize the use of data more than knowledge, while those in the simulation modeling technical report
emphasize knowledge.
Abstractly, a predictive model can generally be formulated using a prediction function f as follows:

y = f (x, t; b) +  (6.1)
where

• y is an response/output scalar,

• x is an predictor/input vector,

• t is a scalar representing time,

• b is the vector of parameters of the function, and

•  represents remaining residuals/errors.

Both the response y and residuals/errors  are treated as random variables, while the predictor/feature
variables x may be treated as either random or deterministic depending on context. Depending on the goals
of the study as well as whether the data are the product of controlled/designed experiments, the random or
deterministic view may be more suitable.
The parameters b can be adjusted so that the predictive model matches the available data. Note,
in the definition of a function, the arguments appear before the “;”, while the parameters appear after.
The residuals/errors are typically additive as shown above, but may also be multiplicative. Of course, the
formulation could be generalized by turning the output/response into a vector y and the parameters into a
matrix B.
When a model is time-independent or time can be treated as just another dimension within the x vectors,
prediction functions can be represented as follows:

151
y = f (x; b) +  (6.2)

Another way to look at such models, is that we are trying to estimate the conditional expectation of y given
x.

y = E [y|x] + 

 = y − f (x; b)

Given a dataset (m instances of data), each instance contributes to an overall residual/error vector .
One of the simpler ways to estimate the parameters b is to minimize the size of the residual/error vector,
e.g., its Euclidean norm. The square of this norm is the sum of squared errors (sse)

sse = kk2 =  ·  (6.3)

This corresponds to minimizing the raw mean square error (mse = sse/m). See the section on Generalized
Linear Models for further development along these lines.
In ScalaTion, data are passed to the train function to train the model/fit the parameters b. In the
case of prediction, the predict function is used to predict values for the scalar response y.
A key question to address is the possible functional forms that f may take, such as the importance of
time, the linearity of the function, the domains for y and x, etc. We consider several cases in the subsections
below.

152
6.1 Predictor
In ScalaTion, the Predictor trait provides a common framework for several predictor classes such as
SimpleRegression or Regression. All of the modeling techniques discussed in this chapter extend the
Predictor trait. They also extend the Fit trait to enable Quality of Fit (QoF) evaluation. (Unlike classes,
traits support multiple inheritance).
Many modeling techniques utilize several predictor/input variables to predict a value for a response/out-
put variable, e.g., given values for [x0 , x1 , x2 ] predict a value for y. The datasets fed into such modeling
techniques will collect multiple instances of the predictor variables into a matrix x and multiple instances of
the response variable into a vector y. The Predictor trait takes datasets of this form.

6.1.1 Predictor Trait

Trait Methods:
1 @ param x the input / data m - by - n matrix
2 ( augment with a first column of ones to include intercept in model )
3 @ param y the response / output m - vector
4 @ param fname the feature / variable names ( if null , use x_j )
5 @ param hparam the hyper - parameters for the model
6

7 trait Predictor ( x : MatrixD , y : VectorD , protected var fname : Array [ String ] ,


8 hparam : Hyp erParame ter )
9 extends Model :
10

11 def getX : MatrixD = x


12 def getY : VectorD = y
13 def getFname : Array [ String ] = fname
14 def numTerms : Int = getX . dim2
15 def train ( x_ : MatrixD = x , y_ : VectorD = y ) : Unit
16 def train2 ( x_ : MatrixD = x , y_ : VectorD = y ) : Unit =
17 def test ( x_ : MatrixD = x , y_ : VectorD = y ) : ( VectorD , VectorD )
18 def trainNtest ( x_ : MatrixD = x , y_ : VectorD = y )
19 ( xx : MatrixD = x , yy : VectorD = y ) : ( VectorD , VectorD ) =
20 def predict ( z : VectorD ) : Double = b dot z
21 def predict ( x_ : MatrixD ) : VectorD =
22 def hparameter : HyperPar ameter = hparam
23 def parameter : VectorD = b
24 def residual : VectorD = e
25

26 def buildModel ( x_cols : MatrixD ) : Predictor = null


27 def selectF eatures ( tech : SelectionTech , idx_q : Int = QoF . rSqBar . ordinal ,
28 cross : Boolean = true ) : ( LinkedHashSet [ Int ] , MatrixD ) =
29 def forwardSel ( cols : LinkedHashSet [ Int ] , idx_q : Int = QoF . rSqBar . ordinal ) : BestStep =
30 def forwardSelAll ( idx_q : Int = QoF . rSqBar . ordinal , cross : Boolean = true ) :
31 ( LinkedHashSet [ Int ] , MatrixD ) =
32 def importance ( cols : Array [ Int ] , rSq : MatrixD ) : Array [( Int , Double ) ] =
33 def backwardElim ( cols : LinkedHashSet [ Int ] , idx_q : Int = QoF . rSqBar . ordinal ,
34 first : Int = 1) : BestStep =
35 def b ackw a rd El im A ll ( idx_q : Int = QoF . rSqBar . ordinal , first : Int = 1 ,
36 cross : Boolean = true ) : ( LinkedHashSet [ Int ] , MatrixD ) =

153
37 def s t e p R e g r e s s i o n A l l ( idx_q : Int = QoF . rSqBar . ordinal , cross : Boolean = true ) :
38 ( LinkedHashSet [ Int ] , MatrixD ) =
39

40 def vif ( skip : Int = 1) : VectorD =


41 inline def testIndices ( n_test : Int , rando : Boolean ) : IndexedSeq [ Int ] =
42 def validate ( rando : Boolean = true , ratio : Double = 0.2)
43 ( idx : IndexedSeq [ Int ] =
44 testIndices (( ratio * y . dim ) . toInt , rando ) ) : VectorD =
45 def crossValidate ( k : Int = 5 , rando : Boolean = true ) : Array [ Statistic ] =

The Predictor trait extends the Model trait (see the end of the Probability chapter) and has the following
methods:

1. The getX method returns the actual data/input matrix used by the model. Some complex models
expand the columns in an initial data matrix to add for example quadratic or cross terms.

2. The getY method returns the actual response/output vector used by the model. Some complex models
transform the initial response vector.

3. The getFname method returns the names of predictor variable/features, both given and extended.

4. The numTerms method returns the number of terms in the model.

5. The train method takes the dataset passed into the model (either the full dataset or a training-data)
and optimizes the model parameters b.

6. The train2 method takes the dataset passed into the model (either the full dataset or a training
dataset) and optimizes the model parameters b. It also optimizes the hyper-parameters.

7. The test method evaluates the Quality of Fit (QoF) either on the full dataset or a designated test-data
using the diagnose method.

8. The trainNtest method trains on the training-set and evaluates on the test-set.

9. The predict method take a data vector (e.g., a new data instance) and predicts its response. Another
predict method takes a matrix as input (with each row being an instance) and makes predictions for
each row.

10. The hparameter method returns the hyper-parameters for the model. Many simple models have none,
but more sophisticated modeling techniques such as RidgeRegression and LassoRegression have
them (e.g., a shrinkage hyper-parameter).

11. The parameter method returns the estimated parameters for the model.

12. The residual method returns the difference between the actual and predicted response vectors. The
residual indicates what the model has left to explain/account for (e.g., an ideal model will only leave
the noise in the data unaccounted for).

13. The buildModel method build a sub-model that is restricted to given columns of the data matrix.
This method of called by the following feature selection methods.

154
14. The selectFeatures methods makes it easy to switch between forward, backward and stepwise feature
selection.

15. The forwardSel method is used for forward selection of variables/features for inclusion into the model.
At each step the variable that increases the predictive power of the model the most is selected. This
method is called repeatedly in forwardSelAll to find “best” combination of features. Not guaranteed
to find the optimal combination.

16. The importance method is used to indicate the relative importance of the features/variables.

17. The bakwardElim method is used for backward elimination of variables/features from the model. At
each step the variable that contributes the least to the predictor power of the model is eliminated. This
method is called repeatedly in bakwardElimAll to find “best” combination of features. Not guaranteed
to find the optimal combination.

18. The stepRegressionAll method decides to add or remove a variable/feature based on whichever leads
to the greater improvement. It continues until there is no further improvement. A swap operation may
yield a better combination of features.

19. The vif method returns the Variance Inflation Factors (VIFs) for each of the columns in the data/input
matrix. High VIF scores may indicate multi-collinearity.

20. The testIndices method returns the indices of the test-set.

21. The validate method divides a dataset into a training-set and a test-set, trains on one and tests on
the other to determine out-of-sample Quality of Fit (QoF).

22. The crossValidate method implements k-fold cross-validation, where a dataset is divided into a
training-set and a test-set. The training-set is used by the train method, while the test-set is used by
the test method. The crossValidate method is similar to validate, but more extensive in that it
repeats this process k times and makes sure all the data ends up in one of the k test-sets.

155
6.2 Quality of Fit for Prediction
The related Fit trait provides a common framework for computing Quality of Fit (QoF) measures. The
dataset for many models comes in the form of an m-by-n data matrix X and an m response vector y. After
the parameters b (an n vector) have been fit/estimated, the error vector  may be calculated. The basic
QoF measures involve taking either `1 (Manhattan) or `2 (Euclidean) norms of the error vector as indicated
in Table 6.1.

Table 6.1: Quality of Fit

error/residual absolute `1 norm squared `2 norm


sum sum of absolute errors sae = kk1 sum of squared errors sse = kk22
mean mean absolute error mae0 = sae/m mean squared error mse0 = sse/m
unbiased mean mean absolute error mae = sae/df mean squared error mse = sse/df

Typically, if a model has m instances/rows in the dataset and n parameters to fit, the error vector will live
in an m − n dimensional space (ignoring issues related to the rank the data matrix). Note, if n = m, there
may be a unique solution for the parameter vector b, in which case  = 0, i.e., the error vector lives in a
0-dimensional space. The Degrees of Freedom (for error) is the dimensionality of the space that the error
vector lives in, namely, df = m − n.

6.2.1 Fit Trait

Trait Methods:
1 @ param dfm the degrees of freedom for model / regression
2 @ param df the degrees of freedom for error
3

4 trait Fit ( private var dfm : Double , private var df : Double )


5 extends FitM :
6

7 def resetDF ( df_update : ( Double , Double ) ) : Unit =


8 def mse_ : Double = mse
9 override def diagnose ( y : VectorD , yp : VectorD , w : VectorD = null ) : VectorD =
10 def ll ( ms : Double = mse0 , s2 : Double = sig2e , m2 : Int = m ) : Double =
11 def fit : VectorD = VectorD ( rSq , rSqBar , sst , sse , mse0 , rmse , mae ,
12 dfm , df , fStat , aic , bic , mape , smape , mase )
13 def help : String = Fit . help
14 def summary ( x_ : MatrixD , fname : Array [ String ] , b : VectorD , vifs : VectorD = null ) :
15 String =

For modeling, a user chooses one the of classes (directly or indirectly) extending the trait Predictor
(e.g., Regression) to instantiate an object. Next the train method would be typically called, followed
by the test method, which computes the residual/error vector and calls the diagnose method. Then the

156
fitMap method would be called to return quality of fit statistics computed by the diagnose method. The
quality of fit measures computed by the diagnose method in the Fit class are shown below.
1 @ param y the actual response / output vector to use ( test / full )
2 @ param yp the predicted response / output vector ( test / full )
3 @ param w the weights on the instances ( defaults to null )
4

5 override def diagnose ( y : VectorD , yp : VectorD , w : VectorD = null ) : VectorD =


6 val e = super . diagnose (y , yp , w )
7

8 if dfm <= 0 || df <= 0 then


9 flaw ( " diagnose " , s " degrees of freedom dfm = df manddf =df must be > 0 " )
10 if dfm = = 0 then dfm = 1 // must have at least 1 DoF
11 // b_0 or b_0 + b_1x_1 or b_1x_1
12 msr = ssr / dfm // mean of squares for model
13 mse = sse / df // mean of squares for error
14

15 rse = sqrt ( mse ) // residual standard error


16 rSqBar = 1 - (1 - rSq ) * r_df // adjusted R - squared
17 fStat = msr / mse // F statistic ( quality of fit )
18 p_fS = 1.0 - fisherCDF ( fStat , dfm . toInt , df . toInt ) // p - value for fStat
19 if p_fS . isNaN then p_fS = 0.0 // NaN = > error by fisherCDF
20 if sig2e = = -1.0 then sig2e = e . variance_
21

22 val ln_m = log ( m ) // natural log of m ( ln ( m ) )


23 aic = ll () + 2 * ( dfm + 1) // Akaike Information Criterion
24 // +1 on dfm accounts for sig2e
25 bic = aic + ( dfm + 1) * ( ln_m - 2) // Bayesian Info . Criterion
26 mape = 100 * ( e . abs / y . abs ) . sum / m // mean abs . percentage error
27 smape = 200 * ( e . abs / ( y . abs + yp . abs ) ) . sum / m // symmetric MAPE
28 mase = Fit . mase (y , yp ) // mean absolute scaled error
29 fit
30 end diagnose

One may look at the sum of squared errors (sse) as an indicator of model quality.

sse =  ·  (6.4)

In particular, sse can be compared to the sum of squares total (sst), which measures the total variability of
the response y,

1  X 2
sst = ky − µy k2 = y · y − m µ2y = y · y − yi (6.5)
m
while the sum of squares regression (ssr = sst − sse) measures the variability captured by the model, so the
coefficient of determination measures the fraction of the variability captured by the model.

ssr sse
R2 = = 1− ≤ 1 (6.6)
sst sst
Values for R2 would be non-negative, unless the proposed model is so bad (worse than the Null Model that
simply predicts the mean) that the proposed model actually adds variability.

157
6.3 Null Model
The NullModel class implements the simplest type of predictive modeling technique. If all else fails it may
be reasonable to simply guess that y will take on its expected value or mean.

y = E [y] +  (6.7)

This could happen if the predictors x are not relevant, not collected in a useful range or the relationship is
too complex for the modeling techniques you have applied.

6.3.1 Model Equation


Ignoring the predictor variables x gives the following simple model equation.

y = b0 +  (6.8)

This intercept-only model is just a constant term plus the error/residual term.

6.3.2 Training
The training dataset in this case only consists of a response vector y. The error vector in this case is

 = y − ŷ = y − b0 1 (6.9)

For Least Squares Estimation (LSE), the loss function L(b) can be set to half the sum of squared errors.

1 1 1
L(b) = sse = kk2 =  ·  (6.10)
2 2 2
Substituting for  gives

1
L(b) = y − b0 1 · y − b0 1 (6.11)
2

6.3.3 Optimization - Derivative


A function can be optimized using Calculus by taking the first derivative and setting it equal to zero. If the
second derivative is positive (negative) it will be minimal (maximal).
In particular, the derivative product rule (for dot products) may be used.

(f · g)0 = f 0 · g + f · g0
(f · f )0 = 2 f 0 · f

1
Dividing by 2 gives,

1
(f · f )0 = f 0 · f (6.12)
2
dL
Taking the derivative w.r.t. b0 , , using the derivative product rule and setting it equal to zero yields the
db0
following equation.

158
dL
= −1 · (y − b0 1) = 0
db0
Therefore, the optimal value for the parameter b0 is

1·y 1·y
b0 = = = µy (6.13)
1·1 m
This shows that the optimal value for the parameter is the mean of the response vector.
In ScalaTion this requires just one line of code inside the train method.
1 def train ( x_null : MatrixD = null , y_ : VectorD = y ) : Unit =
2 b = VectorD ( y_ . mean ) // parameter vector [ b0 ]
3 end train

After values for the model parameters are determined, it it important to assess the Quality of Fit (QoF).
The test method will compute the residual/error vector  and then call the diagnose method.
1 def test ( x_null : MatrixD = null , y_ : VectorD = y ) : ( VectorD , VectorD ) =
2 val yp = VectorD . fill ( y_ . dim ) ( b (0) ) // y predicted for ( test / full )
3 ( yp , diagnose ( y_ , yp ) ) // return predictions and QoF
4 end test

The coefficient of determination R2 for the null regression model is always 0, i.e., none of variance in the
random variable y is explained by the model. A more sophisticated model should only be used if it is better
than the null model, that is when its R2 is strictly greater than zero. Also, a model can have a negative R2
if its predictions are worse than guessing the mean.
Finally, the predict method is simply.
1 def predict ( z : VectorD ) : Double = b (0)

6.3.4 Example Calculation


11
For the training data shown below, the optimal value for the intercept parameter b0 = µy = 4 = 2.75. The
table below shows the values of x, y, ŷ, , and 2 . for the Null Model,

y = 2.75 +  (6.14)

Table 6.2: Null Model: Example Training Data

x y ŷ  2
11
1 1 4 − 74 49
16
11 1 1
2 3 4 4 16
11 1 1
3 3 4 4 16
11 5 25
4 4 4 4 16
19
10 11 11 0 4 = 4.75

The sum of squared errors (sse) is given in the lower, right corner of the table. The sum of squares total for
this dataset is 4.75, so

159
sse 4.75
R2 = 1 − = 1− = 0
sst 4.75
The plot below illustrates how the Null Model attempts to fit the four given data points.

Null Model Line vs. Data Points

y 3

0 1 2 3 4
x

6.3.5 NullModel Class

Class Methods:
1 @ param y the response / output vector
2

3 class NullModel ( y : VectorD )


4 extends Predictor ( MatrixD . one ( y . dim ) , y , Array ( " one " ) , null )
5 with Fit ( dfm = 1 , df = y . dim )
6 with NoSubModels :
7

8 def train ( x_null : MatrixD = null , y_ : VectorD = y ) : Unit =


9 def test ( x_null : MatrixD = null , y_ : VectorD = y ) : ( VectorD , VectorD ) =
10 override def predict ( z : VectorD ) : Double = b (0)
11 override def predict ( x_ : MatrixD ) : VectorD = VectorD . fill ( x_ . dim ) ( b (0) )
12 override def summary ( x_ : MatrixD = getX , fname_ : Array [ String ] = fname ,
13 b_ : VectorD = b , vifs : VectorD = vif () ) : String =

6.3.6 Exercises
1. Determine the value of the second derivative of the loss function

d2 L
= ?
db0 2
at the critical point b0 = µy . What kind of critical point is this?

160
2. Let the response vector y be
1 val y = VectorD (1 , 3 , 3 , 4)

and execute the NullModel.


For context, assume the corresponding predictor vector y is
1 val x = VectorD (1 , 2 , 3 , 4)

Draw an xy plot of the data points. Give the value for the parameter vector b. Show the error distance
for each point in the plot. Compare the sum of squared errors sse with the sum of squares total sst.
What is the value for the coefficient of determination R2 ?

3. Using ScalaTion, analyze the NullModel for the following response vector y.
1 val y = VectorD (2.0 , 3.0 , 5.0 , 4.0 , 6.0) // response vector y
2 println ( s " y = $y " )
3

4 val mod = new NullModel ( y ) // create a null model


5 mod . trainNtest () () // train and test the model
6

7 val z = VectorD (5.0) // predict y for one point


8 val yp = mod . predict ( z ) // yp (y - predicted or y - hat )
9 println ( s " predict (z) =\ yp " )

4. Execute the NullModel on the Auto MPG dataset. See scalation.modeling.Example AutoMPG. What
is the quality of the fit (e.g., R2 or rSq)? Is this value expected? Is is possible for a model to perform
worse than this?

161
6.4 Simpler Regression
The SimplerRegression class supports simpler linear regression. In this case, the predictor vector x consists
of a single variable x0 , i.e., x = [x0 ] and there is only a single parameter that is the coefficient for x0 in the
model.

6.4.1 Model Equation


The goal is to fit the parameter vector b = [b0 ] in the following model/regression equation,

y = b · x +  = b0 x 0 +  (6.15)
where  represents the residuals/errors (the part not explained by the model).

6.4.2 Training
A dataset may be collected for providing an estimate for parameter b0 . Given m data points, stored in an
m-dimensional vector x0 and m response values, stored in an m-dimensional vector y, we may obtain the
following vector equation.

y = b0 x 0 +  (6.16)
One way to find a value for parameter b0 is to minimize the norm of residual/error vector .

minb0 kk (6.17)


Since  = y − b0 x0 , we may solve the following optimization problem:

minb0 ky − b0 x0 k (6.18)
This is equivalent to minimizing half the dot product ( 21 kk2 = 1
2 ·= 1
2 sse). Thus the loss function is

1
L(b) = y − b0 x0 · y − b0 x0 (6.19)
2

6.4.3 Optimization - Derivative


Again, a function can be optimized using Calculus by taking the first derivative and setting it equal to zero.
If the second derivative is positive (negative) it will be minimal (maximal). Taking the derivative w.r.t. b0 ,
dL
, using the derivative product rule (for dot products) gives
db0
1
(f · f )0 = f 0 · f
2
and setting it equal to zero yields the following equation.

dL
= −x0 · (y − b0 x0 ) = 0 (6.20)
db0
Therefore, the optimal value for the parameter b0 is

x0 · y
b0 = (6.21)
x0 · x0

162
6.4.4 Example Calculation

Consider the following data points {(1, 1), (2, 3), (3, 3), (3, 4)} and solve for the parameter (slope) b0 .

[1, 2, 3, 4] · [1, 3, 3, 4] 32 16
b0 = = =
[1, 2, 3, 4] · [1, 2, 3, 4] 30 15

16
Using this optimal value for the parameter b0 = , we may obtain predicted values for each of the x-values.
15

ŷ = ŷ = predict(x0 ) = b0 x0 = [1.067, 2.133, 3.200, 4.267]

Therefore, the error/residual vector is

 = y − ŷ = [1, 3, 3, 4] − [1.067, 2.133, 3.200, 4.267] = [−0.067, 0.867, −0.2, −0.267]

The table below shows the values of x, y, ŷ, , and 2 . for the Simpler Regression Model,

 
16 16
y = · [x] +  = x+
15 15

Table 6.3: Simpler Regression Model: Example Training Data

x y ŷ  2
16 1 1
1 1 15 − 15 225
32 13 169
2 3 15 15 225
48 3 9
3 3 15 − 15 225
64 4 16
4 4 15 − 15 225
160 5 13
10 11 15 15 15 = 0.867

The sum of squared errors (sse) is given in the lower, right corner of the table. The sum of squares total for
this dataset is 4.75, so

sse 0.867
R2 = 1 − = 1− = 0.813
sst 4.75

The plot below illustrates how the Simpler Regression Model attempts to fit the four given data points.

163
Simpler Regression Model Line vs. Data Points

y
2

0 1 2 3 4
x

Note, that this model has no intercept. This makes the solution for the parameter very easy, but may
make the model less accurate. This is remedied in the next section. Since no intercept really means the
intercept is zero, the regression line will go through the origin. This is referred to as Regression Through
the Origin (RTO) and should only be applied when the data scientist has reason to believe it makes sense.

6.4.5 SimplerRegression Class

Class Methods:
1 @ param x the data / input matrix ( only use the first column )
2 @ param y the response / output vector
3 @ param fname_ the feature / variable names ( only use the first name )
4

5 class S i m p l e r R e g r e s s i o n ( x : MatrixD , y : VectorD , fname_ : Array [ String ] = null )


6 extends Predictor (x , y , if fname_ = = null then null else fname_ . slice (0 , 1) ,
7 null )
8 with Fit ( dfm = 1 , df = x . dim - 1)
9 with NoSubModels :
10

11 def train ( x_ : MatrixD = x , y_ : VectorD = y ) : Unit =


12 def test ( x_ : MatrixD = x , y_ : VectorD = y ) : ( VectorD , VectorD ) =
13 override def summary ( x_ : MatrixD = getX , fname_ : Array [ String ] = fname ,
14 b_ : VectorD = b , vifs : VectorD = vif () ) : String =

6.4.6 Exercises
1. For x0 = [1, 2, 3, 4] and y = [1, 3, 3, 4], try various values for the parameter b0 . Plot the sum of squared
errors (sse) vs. b0 . Note, the code must be completed before it is complied and run.

164
1 import scalation . mathstat . _
2

3 @ main def s i m p l e r R e g r e s s i o n _ e x e r _ 1 () : Unit =


4

5 val x0 = VectorD (1 , 2 , 3 , 4)
6 val y = VectorD (1 , 3 , 3 , 4)
7 val b0 = VectorD . range (0 , 50) / 25.0
8 val sse = new VectorD ( b0 . dim )
9 for i <- b0 . indices do
10 val e = ?
11 sse ( i ) = e dot e
12 end for
13 new Plot ( b0 , sse , lines = true )
14

15 end s i m p l e r R e g r e s s i o n _ e x e r _ 1

Where do you think the minimum occurs?


Note, to run your code you may use my scalation outside of ScalaTion. Make sure its lib di-
rectory has the ScalaTion’s j̇ar file. Create a file called SimplerRegression exer 1.scala in your
src/main/scala directory. In your project’s base directory, type sbt. Within sbt type compile and
then run.

2. From the X matrix and y vector, plot the set of data points {(xi1 , yi ) | 0 ≤ i < m} and draw the line
that is nearest to these points. What is the slope of this line. Pass the X matrix and y vector as
arguments to the SimplerRegression class to obtain the b = [b0 ] vector.
1 // 4 data points : x0
2 val x = MatrixD ((4 , 1) , 1 , // x 4 - by -1 matrix
3 2,
4 3,
5 4)
6 val y = VectorD (1 , 3 , 3 , 4) // y vector
7

8 val mod = new S i m p l e r R e g r e s s i o n (x , y ) // create a simpler regression


9 mod . trainNtest () () // train and test the model
10

11 val yp = mod . predict ( x )


12 new Plot ( x (? , 0) , y , yp , lines = true ) // black for y and red for yp

An alternative to using the above constructor new SimplerRegression is to use a factory method
SimplerRegression. Substitute in the following lines of code to do this.
1 val x = VectorD (1 , 2 , 3 , 4)
2 val rg = S i m p l e r R e g r e s s i o n (x , y , null )
3 new Plot (x , y , yp , lines = true )

3. From the X matrix and y vector, plot the set of data points {(xi1 , yi ) | 0 ≤ i < m} and draw the line
that is nearest to these points and intersects the origin [0, 0]. What is the slope of this line? Pass the
X matrix and y vector as arguments to the SimplerRegression class to obtain the b = [b0 ] vector.
1 // 5 data points : x0
2 val x = MatrixD ((5 , 1) , 0 , // x 5 - by -1 matrix
3 1,

165
4 2,
5 3,
6 4)
7 val y = VectorD (2 , 3 , 5 , 4 , 6) // y vector
8

9 val mod = new S i m p l e r R e g r e s s i o n (x , y ) // create a simpler regression


model
10 mod . trainNtest () () // train and test the model
11

12 val z = VectorD (5) // predict y for one point


13 val yp = rg . predict ( z ) // y - predicted
14 println ( s " predict (z) =yp " )

4. Execute the SimplerRegression on the Auto MPG dataset. See scalation.modeling.Example AutoMPG.
What is the quality of the fit (e.g., R2 or rSq)? Is this value expected? What does it say about this
model? Try using different columns for the predictor variable.
d2 L
5. Compute the second derivative of the loss function w.r.t. b0 , . Under what conditions will it be
db0 2
positive?

166
6.5 Simple Regression
The SimpleRegression class supports simple linear regression. It combines the benefits of the last two mod-
eling techniques: the intercept model NullModel and the slope model SimplerRegression. It is guaranteed
to be at least as good as the better of these two modeling techniques. In this case, the predictor vector
x ∈ R2 consists of the constant one and a single variable x1 , i.e., [1, x1 ], so there are now two parameters
b = [b0 , b1 ] ∈ R2 in the model.

6.5.1 Model Equation


The goal is to fit the parameter vector b in the model/regression equation,

y = b · x +  = [b0 , b1 ] · [1, x1 ] +  = b0 + b1 x1 +  (6.22)

where  represents the residuals (the part not explained by the model).

6.5.2 Training
The model is trained on a dataset consisting of m data points/vectors, stored row-wise in an m-by-2 matrix
X ∈ Rm×2 and m response values, stored in an m dimensional vector y ∈ Rm .

y = Xb +  (6.23)

The parameter vector b may be determined by solving the following optimization problem:

minb kk (6.24)

Substituting  = y − ŷ = y − Xb yields

minb ky − Xbk

Using the fact that the matrix X consists of two column vectors 1 and x1 , it can be rewritten,
 
b0
min[b0 ,b1 ] ky − [1 x1 ] k
b1

min[b0 ,b1 ] ky − (b0 1 + b1 x1 )k (6.25)

This is equivalent to minimizing the dot product (kk2 =  ·  = sse)

(y − (b0 1 + b1 x1 )) · (y − (b0 1 + b1 x1 )) (6.26)

Since x0 is just 1, for simplicity we drop the subscript on x1 . Thus the loss function 21 sse is

1
L(b) = y − (b0 1 + b1 x) · y − (b0 1 + b1 x) (6.27)
2

167
6.5.3 Optimization - Gradient
A function of several variables can be optimized using Vector Calculus by setting its gradient (see the Linear
Algebra Chapter) equal to zero and solving the resulting system of equations. When the system of equations
are linear, matrix factorization may be used, otherwise techniques from Nonlinear Optimization may be
needed.
Taking the gradient of the loss function L gives
 
∂L ∂L
∇L = , (6.28)
∂b0 ∂b1
The goal is to find the value of the parameter vector b that yields a zero gradient (flat response surface).
Setting the gradient equal to zero (0 = [0, 0]) yields two equations.
 
∂L ∂L
∇L(b) = (b), (b) = 0 (6.29)
∂b0 ∂b1
The gradient (the two partial derivatives) may be determined using the derivative product rule for dot
products.

1
(f · f )0 = f 0 · f
2

Partial Derivative w.r.t. b0



The first equation results from setting of L to zero.
∂b0

−1 · (y − (b0 1 + b1 x)) = 0
1 · y − 1 · (b0 1 + b1 x) = 0
b0 1 · 1 = 1 · y − b1 1 · x

Since 1 · 1 = m, b0 may be expressed as

1 · y − b1 1 · x
b0 = (6.30)
m

Partial Derivative w.r.t. b1



Similarly, the second equation results from setting of L to zero.
∂b1

−x · (y − (b0 1 + b1 x)) = 0
x · y − x · (b0 1 + b1 x) = 0
b0 1 · x + b1 x · x = x · y

Multiplying by both sides by m produces

m b0 1 · x + m b1 x · x = m x · y (6.31)
Substituting for m b0 = 1 · y − b1 1 · x yields

[1 · y − b1 1 · x]1 · x + m b1 x · x = m x · y
b1 [m x · x − (1 · x)2 ] = m x · y − (1 · x)(1 · y)

168
Solving for b1 gives

m x · y − (1 · x)(1 · y)
b1 = (6.32)
m x · x − (1 · x)2

The b0 parameter gives the intercept, while the b1 parameter gives the slope of the line that best fits the data
points.

6.5.4 Example Calculation

Consider again the problem from the last section where the data points are {(1, 1), (2, 3), (3, 3), (3, 4)} and
solve for the two parameters, (intercept) b0 and (slope) b1 .

4[1, 2, 3, 4] · [1, 3, 3, 4] − (1 · [1, 2, 3, 4])(1 · [1, 3, 3, 4]) 128 − 110 18


b1 = = = = 0.9
4[1, 2, 3, 4] · [1, 2, 3, 4] − (1 · [1, 2, 3, 4])2 120 − 100 20

1 · [1, 3, 3, 4] − 0.9(1 · [1, 2, 3, 4]) 11 − 0.9 ∗ 10


b0 = = = 0.5
4 4

Table ?? below shows the values of x, y, ŷ, , and 2 for the Simple Regression Model,

y = [0.5, 0.9] · [1, x] +  = 0.5 + 0.9x + 

Table 6.4: Simple Regression Model: Example Training Data

x y ŷ  2
1 1 1.4 -0.4 0.16
2 3 2.3 0.7 0.49
3 3 3.2 -0.2 0.04
4 4 4.1 -0.1 0.01
10 11 11 0 0.7

For which models (NullModel, SimplerRegression and SimpleRegression), did the redidual/error vector
 sum to zero?
The sum of squared errors (sse) is given in the lower, right corner of the table. The sum of squares total
for this dataset is 4.75, so the Coefficient of Determination,

sse 0.7
R2 = 1 − = 1− = 0.853
sst 4.75

The plot below illustrates how the Simple Regression Model (SimpleRegression) attempts to fit the
four given data points.

169
Simple Regression Model Line vs. Data Points

y
2

0 1 2 3 4
x

Concise Formulas for the Parameters

More concise and intuitive formulas for the parameters b0 and b1 may be derived.

• Using the definition for mean from Chapter 3 for µx and µy , it can be shown that the expression for
b0 shortens to

b0 = µy − b1 µx (6.33)

Draw a line through the following two points [0, b0 ] (the intercept) and [µx , µy ] (the center of mass).
How does this line compare to the regression line.

• Now, using the definitions for covariance σx,y and variance σx2 from Chapter 3, it can be shown that
the expression for b1 shortens to

σx,y
b1 = (6.34)
σx2

If the slope of the regression line is simply the ratio of the covariance to the variance, what would the
slope be if y = x. It may also be written as follows:

Sxy
b1 = (6.35)
Sxx

− µx )2 .
P P
where Sxy = i (xi − µx )(yi − µy ) and Sxx = i (xi

Table 6.5 extends the previous table to facilitate computing the parameters vector b using the concise
formulas.

170
Table 6.5: Simple Regression Model: Expanded Table with Centering µx = 2.5, µy = 2.75

x x − µx y y − µx ŷ  2
1 -1.5 1 -1.75 1.4 -0.4 0.16
2 -0.5 3 0.25 2.3 0.7 0.49
3 0.5 3 0.25 3.2 -0.2 0.04
4 1.5 4 1.25 4.1 -0.1 0.01
10 0 11 0 11 0 0.7

X
Sxx = (xi − µx )2 = 1.52 + 0.52 + 0.52 + 1.52 = 5
i
X
Syy = (yi − µy )2 = 1.752 + 0.252 + 0.252 + 1.252 = 4.75
i
X
Sxy = (xi − µx )(yi − µy ) = (−1.5 · −1.75) + (−0.5 · 0.25) + (0.5 · 0.25) + (1.5 · 1.25) = 4.5
i

Therefore,

Sxy 4.5
b1 = = = 0.9
Sxx 5

b0 = µy − b1 µx = 2.75 − 0.9 · 2.5 = 2.75 − 2.25 = 0.5 (6.36)

Furthermore, it facilitates computing sst = Syy = 4.75.

6.5.5 Exploratory Data Analysis


As discussed in Chapter 1, Exploratory Data Analysis (EDA) should be performed after preprocessing the
dataset. Once the response variable y is selected, a null model should be created to see in a plot where the
data points lie compared to the mean. The code below shows how to do this for the AutoMPG dataset.
1 import E xa m pl e_ A ut oM PG .{ xy , x , y , x_fname }
2

3 banner ( " Null Model " )


4 val nm = new NullModel ( y )
5 nm . trainNtest () () // train and test the model
6 val yp = nm . predict ( x )
7 new Plot ( null , y , yp , " EDA : y and yp ( red ) vs . t " , lines = true )

Next the relationships between the predictor variable xj (the columns in input/data matrix X) should
be compared. If two of the predictor variables are highly correlated, their individual effects on the response
variable y may be indistinguishable. The correlations between the predictor variable, may be seen by
examining the correlation matrix. Including the response variable in a combined data matrix xy allows one
to see how each predictor variable is correlated with the response.
1 banner ( " Correlation Matrix for Columns of xy " )
2 println ( s " x_fname = ${ stringOf ( x_fname ) } " )

171
3 println ( s " y_name = MPG " )
4 println ( s " xy . corr = ${ xy . corr } " )

Although Simple Regression may be too simple for many problems/datasets, it should be used in Ex-
ploratory Data Analysis (EDA). A simple regression model should be created for each predictor variable xj .
The data points and the best fitting line should be plotted with y on the vertical axis and xj on the hori-
zontal axis. The data scientist should look for patterns/tendencies of y versus xj , such as linear, quadratic,
logarithmic, or exponential patterns. When there is no relationship, the points will appear to be randomly
and uniformly positioned in the plane.
1 for j <- x . indices2 do
2 banner ( s " Plot response y vs . predictor variable ${ x_fname ( j ) } " )
3 val xj = x (? , j )
4 val mod = S i m p l e R e g r e s s i o n ( xj , y , Array ( " one " , x_fname ( j ) ) )
5 mod . trainNtest () () // train and test model
6 val yp = mod . predict ( mod . getX )
7 new Plot ( xj , y , yp , s " EDA : y and yp ( red ) vs . ${ x_fname ( j ) } " , lines = true )
8 end for

The Figure below shows four possible patterns: Linear (blue), Quadratic (purple), Inverse (green), Inverse-
Square (black). Each curve depicts a function 1 + xp , for p = −2, −1, 1, 2.

Finding a Pattern: Linear (blue), Quadratic (purple), Inverse (green), Inverse-Square (black)

10
y

0
0.5 1 1.5 2 2.5 3 3.5
x

To look for quadratic patterns, the following code regresses on the square of each predictor variable (i.e.,
x2j ).
1 for j <- x . indices2 do
2 banner ( s " Plot response y vs . predictor variable ${ x_fname ( j ) } " )
3 val xj = x (? , j )
4 val mod = S i m p l e R e g r e s s i o n . quadratic ( xj , y , Array ( " one " , x_fname ( j ) + " ˆ 2 " ) )
5 mod . trainNtest () () // train and test model
6 val yp = mod . predict ( mod . getX )
7 new Plot ( xj , y , yp , s " EDA : y and yp ( red ) vs . ${ x_fname ( j ) } " , lines = false )
8 end for

172
To determine the effect of having linear and quadratic terms (both xj and x2j ) the Regression class that
supports Multiple Linear Regression or the SymbolicRegression object may be used. Generally, one could
include both terms if there is sufficient improvement over just using one term. If one term is chosen, use
the linear term unless the quadratic term is sufficiently better (see the section on Symbolic Regression for a
more detailed discussion).

Plotting

The Plot and PlotM classes in the mathstat package can be used for plotting data and results. Both use
ZoomablePanel in the scala2d package to support zooming and dragging. The mouse wheel controls the
amount of zooming (scroll value where up is negative and down is positive), while mouse dragging repositions
the objects in the panel (drawing canvas).
1 @ param x the x vector of data values ( horizontal ) , use null to use y s index
2 @ param y the y vector of data values ( primary vertical , black )
3 @ param z the z vector of data values ( secondary vertical , red ) to compare with y
4 @ param _title the title of the plot
5 @ param lines flag for generating a line plot
6

7 class Plot ( x : VectorD , y : VectorD , z : VectorD = null , _title : String = " Plot y vs . x " ,
8 lines : Boolean = false )
9 extends VizFrame ( _title , null ) :

1 @ param x_ the x vector of data values ( horizontal )


2 @ param y_ the y matrix of data values where y ( i ) is the i - th vector ( vertical )
3 @ param label the label / legend / key for each curve in the plot
4 @ param _title the title of the plot
5 @ param lines flag for generating a line plot
6

7 class PlotM ( x_ : VectorD , y_ : MatrixD , var label : Array [ String ] = null ,


8 _title : String = " PlotM y_i vs . x for each i " , lines : Boolean = false )
9 extends VizFrame ( _title , null ) :

6.5.6 SimpleRegression Class

Class Methods:
1 @ param x the data / input matrix augmented with a first column of ones
2 ( only use the first two columns [1 , x1 ])
3 @ param y the response / output vector
4 @ param fname_ the feature / variable names ( only use the first two names )
5

6 class S i m p l e R e g r e s s i o n ( x : MatrixD , y : VectorD , fname_ : Array [ String ] = null )


7 extends Predictor (x , y , if fname_ = = null then null else fname_ . slice (0 , 2) ,
8 null )
9 with Fit ( dfm = 1 , df = x . dim - 2)
10 with NoSubModels :
11

12 def train ( x_ : MatrixD = x , y_ : VectorD = y ) : Unit =


13 def test ( x_ : MatrixD = x , y_ : VectorD = y ) : ( VectorD , VectorD ) =
14 override def summary ( x_ : MatrixD = getX , fname_ : Array [ String ] = fname ,

173
15 b_ : VectorD = b , vifs : VectorD = vif () ) : String =
16 def confInterval ( x_ : MatrixD = getX ) : VectorD =

6.5.7 Exercises
1. From the X matrix and y vector, plot the set of data points {(xi1 , yi ) | 0 ≤ i < m} and draw the line
that is nearest to these points (i.e., that minimize kk). Using the formulas developed in this section,
what are the intercept and slope [b0 , b1 ] of this line.
Also, pass the X matrix and y vector as arguments to the SimpleRegression class to obtain the b
vector.
1 // 4 data points : one x1
2 val x = MatrixD ((4 , 2) , 1 , 1 , // x 4 - by -2 matrix
3 1, 2,
4 1, 3,
5 1 , 4)
6 val y = VectorD (1 , 3 , 3 , 4) // y vector
7

8 val mod = new Si m p l e R e g r e s s i o n (x , y ) // create a simple regression


model
9 mod . trainNtest () () // train and test the model
10

11 val yp = mod . predict ( x )


12 new Plot ( x (? , 1) , y , yp , " plot y and yp vs . x " , lines = true )

2. For more complex models, setting the gradient to zero and solving a system of simultaneous equation
may not work, in which case more general optimization techniques may be applied. Two simple
optimization techniques are grid search and gradient descent.
For grid search, in a spreadsheet set up a 5-by-5 grid around the optimal point for b, found in the
previous problem. Compute values for the loss function L = 12 sse for each point in the grid. Plot h
versus b0 across the optimal point. Do the same for b1 . Make a 3D plot of the surface h as a function
b0 and b1 .
For gradient descent, pick a starting point b0 , compute the gradient ∇L and move −η∇L from b0
where η is the learning rate (e.g., 0.1). Repeat for a few iterations. What is happening to the value of
the loss function L = 21 see.

∇L = [−1 · (y − (b0 1 + b1 x)), −x · (y − (b0 1 + b1 x))]

Substituting  = y − (b0 1 + b1 x), ∇L may be written as

[−1 · , −x · ]

3. From the X matrix and y vector, plot the set of data points {(xi1 , yi ) | 0 ≤ i < m} and draw the line
that is nearest to these points. What are the intercept and slope of this line. Pass the X matrix and
y vector as arguments to the SimpleRegression class to obtain the b vector.

174
1 // 5 data points : one x1
2 val x = MatrixD ((5 , 2) , 1 , 0 , // x 5 - by -2 matrix
3 1, 1,
4 1, 2,
5 1, 3,
6 1 , 4)
7 val y = VectorD (2 , 3 , 5 , 4 , 6) // y vector
8

9 val mod = new Si m p l e R e g r e s s i o n (x , y ) // create a simple regression


10 mod . trainNtest () () // train and test the model
11

12 val yp = mod . predict ( x )


13 new Plot ( x (? , 1) , y , yp , " plot y and yp vs . x " , lines = true )

4. Execute SimpleRegression on the Auto MPG dataset. See scalation.modeling.Example AutoMPG.


What is the quality of the fit (e.g., R2 or rSq)? Is this value expected? Try using different columns for
the predictor variable. Plot y and yp vs. xj for each feature/predictor variable. How do the results
relate to information given in the correlation matrix?

5. Let errors i have E [i ] = 0 and V [i ] = σ 2 , and be independent of each other. Show that the variances
for the parameters b0 and b1 are as follows:

σ2
V [b1 ] =
Sxx
 
Sxy −2
Hint: V [b1 ] = V = Sxx V [Sxy ].
Sxx

µ2
 
1
V [b0 ] = + x σ2
m Sxx

Hint: V [b0 ] = V [µy ] − µ2x V [b1 ]

6. Further assume that i ∼ N(0, σ 2 ). Show that the confidence intervals for the parameters b0 and b1
are as follows:

 
s
b1 ± t∗ √
Sxx
sse
Hint: Let the error variance estimator be s2 = = mse.
m−2
 s 
2
b0 ± t∗ s 1 µ
+ x 
m Sxx

7. For the following simple dataset,


1 val x = VectorD (1 , 2 , 3 , 4 , 5)
2 val y = VectorD (1 , 3 , 3 , 5 , 4)
3 val ox = MatrixD . one ( x . dim ) : ˆ + x

175
sse
estimate the error variance s2 = = mse. Take its square root to obtain the residual standard
m−2
error s. Use these to compute 95% confidence intervals for the parameters: b0 and b1 .

8. Consider the above simple dataset, but where the y values are reversed so the slope is negative and
the fit line is away from the origin,
1 val x = VectorD (1 , 2 , 3 , 4 , 5)
2 val y = VectorD (4 , 5 , 3 , 3 , 1)
3 val ox = MatrixD . one ( x . dim ) : ˆ + x

Compare the SimplerRegression model with the SimpleRegression model. Examine the QoF mea-
sures for each model and make an argument for which model to pick. Also compute R02 (R2 relative
to 0)

ky − ŷk2
R02 = 1 − (6.37)
kyk2

Recall the previous definition for R2 .

ky − ŷk2
R2 = 1 − (6.38)
ky − µy k2

For Regression Through the Origin (RTO) some software packages use R02 in place of R2 . See
[43] for a deeper discussion of the issues involved, including when it is appropriate to not include an
intercept b0 in the model. ScalaTion provides functions for both in the FitM trait: def rSq (the
default) and def rSq0 .

176
6.6 Regression
The Regression class supports multiple linear regression where multiple input/predictor variables are used
to predict a value for the response/output variable. When the response variable has non-zero correlations
with multiple predictor variables, this technique tends to be effective, efficient and leads to explainable
models. It should be applied typically in combination with more complex modeling techniques. In this case,
the predictor vector x is multi-dimensional [1, x1 , ...xk ] ∈ Rn , so the parameter vector b = [b0 , b1 , . . . , bk ] has
the same dimension as x, while response y is a scalar.

x0

b0

x1 y
b1
β
b2
x2

Figure 6.1: Multiple Linear Regression

The intercept can be provided by fixing x0 to one, making b0 the intercept. Alternatively, x0 can be used
as a regular input variable by introducing another parameter β for the intercept. In Neural Networks, β is
referred to as bias and bj is referred to as the edge weight connecting input vertex/node j to the output
node as shown in Figure 6.1. Note, if a activation function fa is added to the model, the Multiple Linear
Regression model becomes a Perceptron model.

6.6.1 Model Equation


The goal is to fit the parameter vector b in the model/regression equation,

y = b · x +  = b0 + b1 x1 + ... + bk xk +  (6.39)
where  represents the residuals (the part not explained by the model).

6.6.2 Training
Using several data samples as a training set (X, y), the Regression class in ScalaTion can be used to
estimate the parameter vector b. Each sample pairs an x input vector with a y response value. The x vectors
are placed into a data/input matrix X ∈ Rm×n row-by-row with a column of ones as the first column in X.
The individual response values taken together form the response vector y ∈ Rm .
The training diagram shown in Figure 6.2 illustrates how the ith instance/row flows through the diagram
computing the predicted response ŷ = b · x and the error  = y − ŷ.

177
x0

b0
b·x
x1 ŷ
b1

(X, y) xi0
b2
 = y − ŷ
xi1 x2

xi2
y
yi

Figure 6.2: Training Diagram for Regression

The matrix-vector product Xb provides an estimate for the response vector ŷ.

y = Xb +  (6.40)

The goal is to minimize the distance between y and its estimate ŷ. i.e., minimize the norm of residual/error
vector.

minb kk (6.41)

Substituting  = y − ŷ = y − Xb yields

minb ky − Xbk (6.42)

This is equivalent to minimizing half the dot product of the error vector with itself ( 12 kk2 = 12  ·  = 12 sse)
Thus, the loss function is

1
L(b) = y − Xb · y − Xb (6.43)
2

6.6.3 Optimization - Gradient


The gradient of the loss function ∇L with respect to the parameter vector b is the vector of partial derivatives.
 
∂L ∂L ∂L
∇L = , ,... (6.44)
∂b0 ∂b1 ∂bk
Again using the product rule for dot products

1
(f · f )0 = f 0 · f (6.45)
2
yields the j th partial derivative.

178
∂L |
= −x:j · (y − Xb) = − x:j (y − Xb) (6.46)
∂bj
Notice that the parameter bj is only multiplied by column x:j in the matrix-vector product Xb. The dot
product is equivalent a transpose operation followed by matrix multiplication. The gradient is formed by
collecting all these partial derivatives together.

|
∇L = − X (y − Xb) (6.47)
Now, setting the gradient equal to the zero vector 0 ∈ Rn yields

|
−X (y − Xb) = 0

| |
−X y + (X X)b = 0
A more detailed derivation of this equation is given in section 3.4 of “Matrix Calculus: Derivation and Simple
Application” [82]. Moving the term involving b to the left side, results in the Normal Equations.

| |
(X X)b = X y (6.48)
Note: equivalent to minimizing the distance between y and Xb is minimizing the sum of the squared
residuals/errors (Least Squares method).
ScalaTion provides five techniques for solving for the parameter vector b based on the Normal Equa-
tions: Matrix Inversion, LU Factorization, Cholesky Factorization, QR Factorization and SVD Factorization.

6.6.4 Matrix Inversion Technique


Starting with the Normal Equations

| |
(X X)b = X y
|
a simple technique is Matrix Inversion, which involves computing the inverse of X X and using it to multiply
both sides of the Normal Equations.

| |
b = (X X)−1 X y (6.49)
| |
where (X X)−1 is an n-by-n matrix, X is an n-by-m matrix and y is an m-vector. When X is full rank,
the expression above involving the X matrix may be referred to as the pseudo-inverse X + .

| |
X + = (X X)−1 X
When X is not full rank, Singular Value Decomposition may be applied to compute X + . Using the pseudo-
inverse, the parameter vector b may be solved for as follows:

b = X +y (6.50)
The pseudo-inverse can be computed by first multiplying X by its transpose. Gaussian Elimination can be
used to compute the inverse of this, which can be then multiplied by the transpose of X. In ScalaTion,
the computation for the pseudo-inverse (x pinv) looks similar to the math.

179
1 val x_pinv = ( x .T * x ) . inverse * x .T

Most of the factorization classes/objects implement matrix inversion, including Fac Inv, Fac LU, Fac Cholesky,
and Fac QR. The default Fac LU combines reasonable speed and robustness.
1 def inverse : MatrixD = Fac_LU . inverse ( this ) ()

For efficiency, the code in Regression does not calculate x pinv, rather is directly solves for the parameters
b.
1 val b = fac . solve ( x .T * y )

The Hat Matrix

Starting the solution to the Normal Equations that takes the inverse for determining the optimal parameter
vector b,

| |
b = (X X)−1 X y (6.51)

One can substitute the rhs into the prediction equation for ŷ = Xb

| |
ŷ = X(X X)−1 X y = Hy (6.52)
| |
where H = X(X X)−1 X is the hat matrix (puts a hat on y). The hat matrix may be viewed as a projection
matrix.

6.6.5 LU Factorization Technique


Lower, Upper Factorization (Decomposition) works like Matrix Inversion, except that is just reduces the
matrix to zeroes below the diagonal, so it tends to be faster and less prone to numerical instability. First
|
the product X X, an n-by-n matrix, is factored

|
X X = LU

where L is a lower left triangular n-by-n matrix and U is an upper right triangular n-by-n matrix. Then the
normal equations may be rewritten

|
LU b = X y

Letting w = U b allows the problem to solved in two steps. The first is solved by forward substitution to
determine the vector w.

|
Lw = X y

Finally, the parameter vector b is determined by backward substitution.

Ub = w

180
Example Calculation

Consider the example where the input/data matrix X and output/response vector y are as follows:

   
1 1 1
1 2 3
X =   , y =  
   
1 3 3
1 4 4
| |
Putting these values into the Normal Equations (X X)b = X y yields

" #
4 10 11
10 30 32

Multiply the first row by -2.5 and add it to the second row,

" #
4 10 11
0 5 4.5
|
This results in the following optimal parameter vector b = [.5, .9]. Note, the product of L and U gives X X.

" #" # " #


1 0 4 10 4 10
=
2.5 1 0 5 10 30

6.6.6 Cholesky Factorization Technique


|
A faster and slightly more stable technique is to use Cholesky Factorization. Since the product X X is a
positive definite, symmetric matrix, it may factored using Cholesky Factorization into

| |
X X = LL
where L is a lower triangular n-by-n matrix. Then the normal equations may be rewritten

| |
LL b = X y
|
Letting w = L b, we may solve for w using forward substitution

|
Lw = X y
and then solve for b using backward substitution.

|
L b = w
| |
As an example, the product of L and its transpose L gives X X.

" #" # " #


2 0 2 5 4 10
√ √ =
5 5 0 5 10 30

181
Therefore, w can be determined by forward substitution and b by backward substitution.

" # " # " #


2 0 11 2 5
√ w = , √ b = w
5 5 32 0 5

6.6.7 QR Factorization Technique


A slightly slower, but even more robust technique is to use QR Factorization. Using this technique, the
m-by-n X matrix can be factored directly, which increases the stability of the technique.

X = QR

where Q is an orthogonal m-by-n matrix and R matrix is a right upper triangular n-by-n matrix. Starting
again with the Normal Equations,

| |
(X X)b = X y

simply substitute QR for X.

| |
(QR) QRb = (QR) y

Taking the transpose gives

| | | |
R Q QRb = R Q y
|
and using the fact that Q Q = I, we obtain the following:

| | |
R Rb = R Q y
|
Multiply both sides by (R )−1 yields

|
Rb = Q y

Since R is an upper triangular matrix, the parameter vector b can be determined by backward substitution.
Alternatively, the pseudo-inverse may be computed as follows:

|
X + = R−1 Q

ScalaTion uses Householder Orthogonalization (alternately Modified Gram-Schmidt Orthogonalization)


to factor X into the product of Q and R.

6.6.8 Use of Factorization in Regression


By default, ScalaTion uses QR Factorization for matrix factorization. The other techniques may be
selected by changing the hyper-parameter (algorithm), setting it to Cholesky, SVD, LU or Inverse. For
more information see http://see.stanford.edu/materials/lsoeldsee263/05-ls.pdf.
Based on the selected algorithm, the appropriate type of matrix factorization is performed. The first
part of the code below constructs and returns a factorization object.

182
1 private def solver ( x_ : MatrixD ) : Factorization =
2 algorithm match
3 case " Fac_Cholesky " = > new Fac_Cholesky ( x_ .T * x_ ) // Cholesky Factorization
4 case " Fac_LU " = > new Fac_LU ( x_ .T * x_ ) // LU Factorization
5 case " Fac_Inverse " = > new Fac_Inverse ( x_ .T * x_ ) // Inverse Factorization
6 case " Fac_SVD " = > new Fac_SVD ( x_ ) // Singular Value Decomp .
7 case _ = > new Fac_QR ( x_ ) // QR Factorization
8 end match
9 end solver

The train method below computes parameter/coefficient vector b by calling the solve method provided
by the factorization classes.
1 def train ( x_ : MatrixD = x , y_ : VectorD = y ) : Unit =
2 val fac = solver ( x_ )
3 fac . factor () // factor the matrix
4

5 b = fac match // RECORD the parameters


6 case fac : Fac_QR = > fac . solve ( y_ )
7 case fac : Fac_SVD = > fac . solve ( y_ )
8 case _ = > fac . solve ( x_ .T * y_ )
9

10 if b (0) . isNaN then flaw ( " train " , s " parameter b = $b " )
11 debug ( " train " , s "$fac estimates parameter b = $b " )
12 end train

After training, the test method does two things: First, the residual/error vector  is computed. Second,
several quality of fit measures are computed by calling the diagnose method.
1 def test ( x_ : MatrixD = x , y_ : VectorD = y ) : ( VectorD , VectorD ) =
2 val yp = predict ( x_ ) // make predictions
3 e = y_ - yp // RECORD the residuals / errors
4 ( yp , diagnose ( y_ , yp ) ) // return predictions and QoF
5 end test

To see how the train and test methods work in a Regression model see the Collinearity Test and Texas
Temperatures examples in subsequent subsections.

6.6.9 Model Assessment


The quality of fit measures includes the coefficient of determination R2 as well as several others.

183
Degrees of Freedom

Given m instances, k variables and n parameters in a regression model,

Table 6.6: Degrees of Freedom: Part 1

Instances m number of data points


Variables k number of non-redundant predictor variables
Parameters n number of parameters

the prediction vector ŷ is a projection of the response vector y ∈ Rm onto Rk , the space (hyperplane)
spanned by the vectors x1 , . . . xk . Since  = y − ŷ, one might think that the residual/error  ∈ Rm−k . As
P
i i = 0 when an intercept parameter b0 is included in the model (n = k + 1), this constraint reduces the
dimensionality of the space by one, so  ∈ Rm−n .
Therefore, the Degrees of Freedom (DoF) captured by the regression model is dfr and left for error is df
are indicated in the table below.

Table 6.7: Degrees of Freedom: Part 2

dfr k degrees of freedom regression/model


df m−n degrees of freedom residuals/error

As an example, the equation ŷ = 2x1 + x2 + .5 defines a dfr = 2 dimensional hyperplane (or ordinary plane)
as shown in Figure 6.3.

10

0
y

−10 5

−4 0
−2 0 2 4 x2
x1 −5

Figure 6.3: Hyperplane: ŷ = 2x1 + x2 + .5

It is important to remember that if the model has an intercept, k = n − 1, otherwise k = n.


Note, for more complex or regularized models, effective Degrees of Freedom (eDoF) may be used, see the
exercises in the section of Ridge Regression.

184
Adjusted Coefficient of Determination R̄2

The ratio of total Degrees of Freedom to Degrees of Freedom for error is

dfr + df
rdf =
df
SimplerRegression is at one extreme of model complexity, where df = m−1 and dfr = 1, so rdf = m/(m−1)
is close to one. For a more complicated model, say with n = m/2, rdf will be close to 2. This ratio can be
used to adjust the Coefficient of Determination R2 to reduce it with increasing number of parameters. This
is called the Adjusted Coefficient of Determination R̄2

R̄2 = 1 − rdf (1 − R2 )

Suppose m = 121, n = 21 and R2 = 0.9, as an exercise, show that rdf = 1.2 and R̄2 = 0.88.
Dividing sse and ssr by their respective Degrees of Freedom gives the mean square error and regression,
respectively

mse = sse / df
msr = ssr / dfr

The mean square error mse follows a Chi-square distribution with df Degrees of Freedom, while the mean
square regression msr follows a Chi-square distribution with dfr Degrees of Freedom. Consequently, the
ratio

msr
∼ Fdfr ,df (6.53)
mse
that is, it follows an F -distribution with (dfr , df ) Degrees of Freedom. If this number exceeds the critical
value, one can claim that the parameter vector b is not zero, implying the model is useful. More general
quality of fit measures useful for comparing models are the Akaike Information Criterion (AIC) and Bayesian
Information Criterion (BIC).
In ScalaTion the several Quality of Fit (QoF) measures are computed by the diagnose method in the
Fit class, as described in section 1 of this chapter.
1 def diagnose ( y : VectorD , yp : VectorD , w : VectorD = null )

It looks at different ways to measure the difference between the actual y and predicted yp values for the
response. The differences are optionally weighted by the vector w. Weighting is not applied when w is null.

6.6.10 Model Validation


Data are needed to two purposes: First, the characteristics or patterns of the data need to be investigated
to select an appropriate modeling technique, features for a model and finally to estimate the parameters
and probabilities used by the model. Data Scientists assisted by tools do the first part of this process, while
the latter part is called training. Hence the train method is part of all modeling techniques provided by
ScalaTion. Second, data are needed to test the quality of the trained model.
One approach would be to train the model using all the available data. This makes sense, since the more
data used for training, the better the model. In this case, the testing data would need to be same as the
training leading to whole dataset evaluation (in-sample).

185
Now the difficult issue is how to guard against over-fitting. With enough flexibility and parameters to
fit, modeling techniques can push quality measures like R2 to perfection (R2 = 1) by fitting the signal and
the noise in the data. Doing so tends to make a model worse in practice than a simple model that just
captures the signal. That is where quality measures like R̄2 (or AIC) come into play, but computations of
R̄2 require determination of Degrees of Freedom (df ), which may be difficult for some modeling techniques.
Furthermore, the amount of penalty introduced by such quality measures is somewhat arbitrary.
Would not it be better to measure quality in way in which models fitting noise are downgraded because
they perform more poorly on data they have not seen? Is it really a test, if the model has already seen
the data? The answers to these questions are obvious, but the solution of the underlying problem is a bit
tricky. The first thought would be to divide a dataset in half, but then only half of the data are available
for training. Also, picking a different half may result in substantially different quality measures.
This leads to two guiding principles: First, the majority of the data should be used for training. Second,
multiple testing should be done. In general, conducting real-world tests of a model can be difficult. There
are, however, strategies that attempt to approximate such testing. Two simple and commonly used strategies
are the following: Leave-One-Out and Cross-Validation. In both cases, a dataset is divided into a training
set and a test set.

Leave-One-Out

When fitting the parameters b the more data available in the training set, in all likelihood, the better the
fit. The Leave-One-Out strategy takes this to the extreme, by splitting the dataset into a training set of
size m − 1 and test set of size 1 (e.g., row t in data matrix X). From this, a test error can be computed
yt − b · xt . This can be repeated by iteratively letting t range from the first to the last row of data matrix
X. For certain predictive analytics techniques such as Multiple Linear Regression, there are efficient ways
to compute the test sse based on the leverage each point in the training set has [85].

k-Fold Cross-Validation

A more generally applicable strategy is called cross-validation, where a dataset is divided into k test sets.
For each test set, the corresponding training set is all the instances not chosen for that test set. A simple
way to do this is to let the first test dataset be first m/k rows of matrix X, the second be the second m/k
rows, etc.
1 val tsize = m / k // test set size
2 for l <- 0 until k do
3 x_e = x ( l * tsize until (( l +1) * tsize ) // l - th test set
4 x_ = x . not ( l * tsize until (( l +1) * tsize ) ) // l - th training set
5 end for

The model is trained k times using each of the training sets. The corresponding test set is then used to
estimate the test sse (or other quality measure such as mse). These are more meaningful out-of-sample
results. From each of these samples, a mean, standard deviation and confidence interval may be computed
for the test sse.
Due to patterns that may exist in the dataset, it is more robust to randomly select each of the test sets.
The row indices may be permuted for random selection that ensures that all data instances show up exactly
in one test set.

186
Typically, training QoF (in-sample) measures such as R2 will be better than testing QoF (out-of-sample)
2
measures such as Rcv . Adjusted measures such as R¯2 are intending to more closely follow Rcv
2
than R2 .
ScalaTion support cross-validation via is crossValidate method.
1 @ param k the number of cross - validation iterations / folds ( defaults to 5 x ) .
2 @ param rando flag indicating whether to use randomized or simple cross - validation
3

4 def crossValidate ( k : Int = 5 , rando : Boolean = true ) : Array [ Statistic ] =

It also supports a simpler strategy that only tests once, via its validate method defined in the Predictor
trait. It utilizes the Test-n-Train Split TnT Split from the mathstat package.
1 @ param rando flag indicating whether to use randomized or simple validation
2 @ param ratio the ratio of the TESTING set to the full dataset ( e . g . , 70 -30 , 80 -20)
3 @ param idx the prescribed TESTING set indices
4

5 def validate ( rando : Boolean = true , ratio : Double = 0.2)


6 ( idx : IndexedSeq [ Int ] =
7 testIndices ( rando , ( ratio * y . dim ) . toInt ) ) : VectorD =
8 val ( x_e , x_ , y_e , y_ ) = TnT_Split (x , y , idx ) // Test -n - Train Split
9

10 train ( x_ , y_ )
11 val qof = test ( x_e , y_e ) . _2
12 if qof ( QoF . sst . ordinal ) <= 0.0 then
13 flaw ( " validate " , " chosen testing set has no variability " )
14 end if
15 println ( FitM . fitMap ( qof , QoF . values . map ( _ . toString ) ) )
16 qof
17 end validate

6.6.11 Collinearity
Consider the matrix-vector equation used for estimating the parameters b via the minimization of kk.

y = Xb + 
The parameter/coefficient vector b = [b0 , b1 , . . . , bk ] may be viewed as weights on the column vectors in the
data/predictor matrix X.

y = b0 1 + b1 x:1 + . . . + bk x:k + 
A question arises when two of these column vectors are nearly the same (or more generally nearly parallel
or anti-parallel). They will affect and may obfuscate each others’ parameter values.
First, we will examine ways of detecting such problems and then give some remedies. A simple check is to
compute the correlation matrix for the column vectors in matrix X. High (positive or negative) correlation
indicates collinearity.

Example Problem

Consider the following data/input matrix X and response vector y. This is the same example used for
SimpleRegression with new variable x2 added (i.e., y = b0 + b1 x1 + b2 x2 + ). The collinearityTest
main function allows one to see the effects of increasing the collinearity of features/variables x1 and x2 .

187
1 package < your - package >
2

3 import scalation . modeling . Regression


4 import scalation . mathstat .{ MatrixD , VectorD }
5

6 @ main def c o ll i n e a r i t y T e s t () : Unit =


7

8 // one x1 x2
9 val x = MatrixD ((4 , 3) , 1 , 1 , 1 , // input / data matrix
10 1, 2, 2,
11 1, 3, 3,
12 1 , 4 , 0) // change 0 by adding .5 until it ’s 4
13

14 val y = VectorD (1 , 3 , 3 , 4) // output / response vector


15

16 val v = x (? , 0 until 2)
17 banner ( s " Test without column x2 " )
18 println ( s " v = $v " )
19 var mod = new Regression (v , y )
20 mod . trainNtest () ()
21 println ( mod . summary () )
22

23 for i <- 0 to 8 do
24 banner ( s " Test Increasing Collinearity : x_32 = ${ x (3 , 2) } " )
25 println ( s " x = $x " )
26 println ( s " x . corr = ${ x . corr } " )
27 mod = new Regression (x , y )
28 mod . trainNtest () ()
29 println ( mod . summary () )
30 x (3 , 2) + = 0.5
31 end for
32

33 end co ll i n e a r i t y T e s t

Try changing the value of element x32 from 0 to 4 by .5 and observe what happens to the correlation
matrix. What effect do these changes have on the parameter vector b = [b0 , b1 , b2 ] and how do the first two
parameters compare to the regression where the last column of X is removed giving the parameter vector
b = [b0 , b1 ].
The corr method is provided by the scalation.mathstat.MatrixD class. For this method, if either
column vector has zero variance, when the column vectors are the same, it returns 1.0, otherwise -0.0
(indicating undefined).
Note, perfect collinearity produces a singular matrix, in which case many factorization algorithms will
give NaN (Not-a-Number) for much of their output. In this case, Fac SVD (Singular Value Decomposition)
should be used. This can be done by changing the following hyper-parameter provided by the Regression
object, before instantiating the Regression class.

1 Regression . hp ( " factorization " ) = " Fac_SVD "


2 val mod = new Regression (x , y )

188
Multi-Collinearity

Even if no particular entry in the correlation matrix is high, a column in the matrix may still be nearly
a linear combination of other columns. This is the problem of multi-collinearity. This can be checked by
computing the Variance Inflation Factor (VIF) function (or vif in ScalaTion). For a particular parameter
bj for the variable/predictor xj , the function is evaluated as follows:

1
vif(bj ) = (6.54)
1 − R2 (xj )
where R2 (xj ) is R2 for the regression of variable xj onto the rest of the predictors. It measures how well the
variable xj (or its column vector x:j ) can be predicted by all xl for l 6= j. Values above 20 (R2 (xj ) = 0.95)
are considered by some to be problematic. In particular, the value for parameter bj may be suspect, since
its variance is inflated by vif(bj ).

mse
σ̂ 2 (bj ) = · vif(bj ) (6.55)
k σ̂ 2 (xj )
See the exercises for details. Both corr and vif may be tested in ScalaTion using RegressionTest4.
One remedy to reduce collinearity/multi-collinearity is to eliminate the variable with the highest corr/vif
value. Another is to use regularized regression such as RidgeRegression or LassoRegression.

6.6.12 Feature Selection


There may be predictor variables (features) in the model that contribute little in terms of their contributions
to the model’s ability to make predictions. The improvement to R2 may be small and may make R̄2 or other
quality of fit measures worse. An easy way to get a basic understanding is to compute the correlation of
each predictor variable x:j (j th column of matrix X) with the response vector y. A more intuitive way to
do this would be to plot the response vector y versus each predictor variable x:j . See the exercises for an
example.
Ideally, one would like pick a subset of the k variables that would optimize a selected quality measure.
Unfortunately, there are 2k possible subsets to test. Two simple techniques (greedy algorithms) for selecting
features are forward selection and backward elimination. A combination of these two is provided by stepwise
regression.

Forward Selection

The forewordSel method, coded in the Predictor trait, performs forward selection by adding the most
predictive variable to the existing model, returning the variable to be added and a reference to the new
model with the added variable/feature.
1 @ param cols the columns of matrix x currently included in the existing model
2 @ param idx_q index of Quality of Fit ( QoF ) to use for comparing quality
3

4 def forwardSel ( cols : LinkedHashSet [ Int ] , idx_q : Int = QoF . rSqBar . ordinal ) : BestStep =

The BestStep is used to record the best improvement step found so far.
1 @ param col the column / variable to ADD / REMOVE for this step
2 @ param qof the Quality of Fit ( QoF ) for this step
3 @ param mod the model including selected features / variables for this step

189
4

5 case class BestStep ( col : Int = -1 , qof : VectorD = null , mod : Predictor = null )

Selecting the most predictive variable to add boils down to comparing on the basis of a Quality of Fit
(QoF) measure. The default is the Adjusted Coefficient of Determination R̄2 . The optional argument idx q
indicates which QoF measure to use (defaults to QoF.rSqBar.ordinal). To start with a minimal model, set
cols = Set (0) for an intercept-only model. The method will consider every variable/column x.indices2
not already in cols and pick the best one for inclusion.
1 for j <- x . indices2 if ! ( cols contains j ) do

To find the best model, the forwardSel method should be called repeatedly while the quality of fit measure
is sufficiently improving. This process is automated in the forwardSelAll method.
1 @ param idx_q index of Quality of Fit ( QoF ) to use for comparing quality
2 @ param cross whether to include the cross - validation QoF measure
3

4 def forwardSelAll ( idx_q : Int = QoF . rSqBar . ordinal , cross : Boolean = true ) :
5 ( LinkedHashSet [ Int ] , MatrixD ) =

The forwardSelAll method takes the QoF measure to use as the selection criterion and whether to apply
cross-validation as inputs and returns the best collection of features/columns to include in the model as well
as the QoF measures for all steps.
To see how R2 , R̄2 , sMAPE, and Rcv
2
change with the number of features/parameters added to the model
by forwardSelAll method, run the following test code from the scalation modeling module.

sbt> runMain scalation.modeling.regressionTest5

sMAPE, symmetric Mean Absolute Percentage Error, is explained in detail in the Time Series/Temporal
Models Chapter.

Backward Elimination

The backwardElim method, coded in the Predictor trait, performs backward elimination by removing the
least predictive variable from the existing model, returning the variable to eliminate, the new parameter
vector and a reference to the new model with the removed variable/feature.
1 @ param cols the columns of matrix x currently included in the existing model
2 @ param idx_q index of Quality of Fit ( QoF ) to use for comparing quality
3 @ param first first variable to consider for elimination
4 ( default (1) assume intercept x_0 will be in any model )
5

6 def backwardElim ( cols : LinkedHashSet [ Int ] , idx_q : Int = QoF . rSqBar . ordinal ,
7 first : Int = 1) : BestStep =

To start with a maximal model, set cols = Set (0, 1, ..., k) for a full model. As with forwardSel,
the idx q optional argument allows one to choose from among the QoF measures. The last parameter first
provides immunity from elimination for any variable/parameter that is less than first (e.g., to ensure that
models include an intercept b0 , set first to one). The method will consider every variable/column from
first until x.dim2 in cols and pick the worst one for elimination.
1 for j <- first until x . dim2 if cols contains j do

190
To find the best model, the backwardElim method should be called repeatedly until the quality of fit measure
sufficiently decreases. This process is automated in the backwardElimAll method.
1 @ param idx_q index of Quality of Fit ( QoF ) to use for comparing quality
2 @ param first first variable to consider for elimination
3 @ param cross whether to include the cross - validation QoF measure
4

5 def b ackw a rd El im A ll ( idx_q : Int = QoF . rSqBar . ordinal , first : Int = 1 ,


6 cross : Boolean = true ) :
7 ( LinkedHashSet [ Int ] , MatrixD ) =

The backwardElimAll method takes the QoF measure to use as the selection criterion, the index of the first
variable to consider for elimination, and whether to apply cross-validation as inputs and returns the best
collection of features/columns to include in the model as well as the QoF measures for all steps.
Some studies have indicated that backward elimination can outperform forward selection, but it is difficult
to say in general.
More advanced feature selection techniques include using genetic algorithms to find near optimal subsets
of variables as well as techniques that select variables as part of the parameter estimation process, e.g.,
LassoRegression.

Stepwise Regression

An improvement over Forward Selection and Backward Elimination is possible with Stepwise Regression.
It starts with either no variables or the intercept in the model and adds one variable that improves the
selection criterion the most. It then adds the second best variable for step two. After the second step, it
determines whether it is better to add or remove a variable. It continues in this fashion until no improvement
in the selection criterion is found at which point it terminates. Note, for Forward Selection and Backward
Elimination it may instructive to continue all the way to the end (all variables for forward/no variables for
backward).
Stepwise regression may lead to coincidental relationships being included in the model, particularly if a t-
test is the basis of inclusion or a penalty-free QoF measure such as R2 is used. Typically, this approach is used
when there a penalty for having extra variables/parameters, e.g., R2 adjusted R̄2 , R2 cross-validation Rcv 2
or
Akaike Information Criterion (AIC). See the section on Maximum Likelihood Estimation for a definition of
AIC. Alternatives to Stepwise Regression include Lasso Regression (`1 regularization) and to a lesser extent
Ridge Regression (`2 regularization).
ScalaTion provides the stepRegressionAll method for Stepwise Regression. At each step it calls
forwardSel and backwardElim and chooses the one yielding better improvement.
1 @ param idx_q index of Quality of Fit ( QoF ) to use for comparing quality
2 @ param cross whether to include the cross - validation QoF measure
3

4 def s t e p R e g r e s s i o n A l l ( idx_q : Int = QoF . rSqBar . ordinal , cross : Boolean = true ) :


5 ( LinkedHashSet [ Int ] , MatrixD ) =

An option for further improvement is to add a swapping operation, which finds the best variable to remove
and replace with a variable not in the model. Unfortunately, this may lead to a quadratic number of steps in
the worst-case (as opposed to linear for forward, backward and stepwise without swapping). See the exercises
for more details.

191
Categorical Variables/Features

For Regression, the variables/features have so far been treated as continuous or ordinal. However, some
variables may be categorical in nature, where there is no ordering of the values for a categorical variable.
Although one can encode “English”, “French”, “Spanish” as 0, 1, and 2, it may lead to problems such
as concluding the average of “English” and “Spanish” is ‘French”.
In such cases, it may be useful to replace a categorical variable with multiple dummy variables. Typically,
a categorical variable (column in the data matrix) taking on k distinct values is replaced with with k − 1
dummy variables (columns in the data matrix). For details on how to do this effectively, see the section on
RegressionCat.

6.6.13 Regression Problem: Texas Temperatures


Solving a regression problem in ScalaTion simply involves creating a data/input matrix X ∈ Rm×n and a
response/output vector y ∈ Rm and then creating a Regression object upon which the trainNtest method
is called. The trainNtest method conveniently calls the train, test and report methods internally.
The Texas Temperature dataset below from http://www.stat.ufl.edu/~winner/cases/txtemp.ppt is
used to illustrate how to use ScalaTion for a regression problem. The purpose of the model is to predict
average January high temperatures for 16 Texas county weather stations based on their Latitude, Elevation
and Longitude.
1 // 16 data points : one x1 x2 x3 y
2 // Lat Elev Long Temp County
3 val xy = MatrixD ((16 , 5) , 1.0 , 29.767 , 41.0 , 95.367 , 56.0 , // Harris
4 1.0 , 32.850 , 440.0 , 96.850 , 48.0 , // Dallas
5 1.0 , 26.933 , 25.0 , 97.800 , 60.0 , // Kennedy
6 1.0 , 31.950 , 2851.0 , 102.183 , 46.0 , // Midland
7 1.0 , 34.800 , 3840.0 , 102.467 , 38.0 , // Deaf Smith
8 1.0 , 33.450 , 1461.0 , 99.633 , 46.0 , // Knox
9 1.0 , 28.700 , 815.0 , 100.483 , 53.0 , // Maverick
10 1.0 , 32.450 , 2380.0 , 100.533 , 46.0 , // Nolan
11 1.0 , 31.800 , 3918.0 , 106.400 , 44.0 , // El Paso
12 1.0 , 34.850 , 2040.0 , 100.217 , 41.0 , // Collington
13 1.0 , 30.867 , 3000.0 , 102.900 , 47.0 , // Pecos
14 1.0 , 36.350 , 3693.0 , 102.083 , 36.0 , // Sherman
15 1.0 , 30.300 , 597.0 , 97.700 , 52.0 , // Travis
16 1.0 , 26.900 , 315.0 , 99.283 , 60.0 , // Zapata
17 1.0 , 28.450 , 459.0 , 99.217 , 56.0 , // Lasalle
18 1.0 , 25.900 , 19.0 , 97.433 , 62.0) // Cameron
19

20 banner ( " Texas Temperatures Regression " )


21 val mod = Regression ( xy ) () // create a regression model
22 mod . trainNtest () () // train and test the model
23 println ( mod . summary () ) // parameter / coefficient statistics

The trainNtest Method

The trainNtest method defined in the Predictor trait does several things: trains the model on x and y ,
tests the model on xx and yy, produces a report about training and testing, and optionally plots y-actual
and y-predicted.

192
1 @ param x_ the training / full data / input matrix ( defaults to full x )
2 @ param y_ the training / full response / output vector ( defaults to full y )
3 @ param xx the testing / full data / input matrix ( defaults to full x )
4 @ param yy the testing / full response / output vector ( defaults to full y )
5

6 def trainNtest ( x_ : MatrixD = x , y_ : VectorD = y )


7 ( xx : MatrixD = x , yy : VectorD = y ) : ( VectorD , VectorD ) =
8 train ( x_ , y_ )
9 debug ( " trainNTest " , s " b = $b " )
10 val ( yp , qof ) = test ( xx , yy )
11 println ( report ( qof ) )
12 if DO_PLOT then
13 val lim = min ( yy . dim , LIMIT )
14 val ( qyy , qyp ) = ( yy (0 until lim ) , yp (0 until lim ) ) // slice to LIMIT
15 val ( ryy , ryp ) = orderByY ( qyy , qyp ) // order by yy
16 new Plot ( null , ryy , ryp , s "$modelName : y actual , predicted " )
17 end if
18 ( yp , qof )
19 end trainNtest

The report Method

The report method returns the following basic information: (1) the name of the modeling technique nm, (2)
the values of the hyper-parameters hp (used for controlling the model/optimizer), (3) the feature/predictor
variable names fn, (4) the values of the parameters b, and (5) several Quality of Fit measures qof.

REPORT
----------------------------------------------------------------------------
modelName mn = Regression
----------------------------------------------------------------------------
hparameter hp = HyperParameter(factorization -> (Fac_QR,Fac_QR))
----------------------------------------------------------------------------
features fn = Array(x0, x1, x2, x3)
----------------------------------------------------------------------------
parameter b = VectorD(151.298,-1.99323,-0.000955478,-0.384710)
----------------------------------------------------------------------------
fitMap qof = LinkedHashMap(
rSq -> 0.991921, rSqBar -> 0.989902, sst -> 941.937500, sse -> 7.609494,
mse0 -> 0.475593, rmse -> 0.689633, mae -> 0.531353, dfm -> 3.000000,
df -> 12.000000, fStat -> 491.138015, aic -> -8.757481, bic -> -5.667126,
mape -> 1.095990, smape -> 1.094779, mase -> 0.066419)

The plot below shows the results from running the ScalaTion Regression Model in terms of actual (y) vs.
predicted (yp) response vectors.

193
Regression Model: y(*) vs yp(+)

60

y, yp
50

40

0 5 10 15
index

The summary Method

More details about the parameters/coefficients including standard errors, t-values, p-values, and Variance
Inflation Factors (VIFs) are shown by the summary method.
1 println ( mod . summary () )

For the Texas Temperatures dataset it provides the following information: The Estimate is the value assigned
to the parameter for the given Var. The Std. Error, t-value, p-value and VIF are also given.

fname = Array(x0 = intercept, x1 = Lat, x2 = Elev, x3 = Long)


SUMMARY
Parameters/Coefficients:
Var Estimate Std. Error t value Pr(>|t|) VIF
----------------------------------------------------------------------------------
x0 151.297616 25.133361 6.019792 0.000060 NA
x1 -1.993228 0.136390 -14.614194 0.000000 4.228079
x2 -0.000955 0.000568 -1.683440 0.118102 16.481808
x3 -0.384710 0.228584 -1.683018 0.118185 9.432463

Residual standard error: 0.796319 on 12.0 degrees of freedom


Multiple R-squared: 0.991921, Adjusted R-squared: 0.989902
F-statistic: 491.1380151109529 on 3.0 and 12.0 DF, p-value: 8.089084957418891E-13
----------------------------------------------------------------------------------

Given the following assumptions: (1)  ∼ D(0, σI) for some distribution D and (2) for each column j,  and
xj are independent, the covariance matrix of the parameter vector b is

|
C [b] = σ 2 (X X)−1 (6.56)

See [159] for a derivation. Using σ̂ 2 as an estimate for σ 2 ,

194
·
σ̂ 2 = = mse (6.57)
df
the standard deviation (or standard error) of the j th parameter/coefficient may be given as the square root
of the j th diagonal element of the covariance matrix.
q
σ̂bj = σ̂ (X | X)−1 jj (6.58)
The corresponding t-value is simply the parameter value divided by its standard error, which indicates how
many standard deviation units it is away from zero. The farther way from zero the more significant (or more
important to the model) the parameter is.

bj
t(bj ) = (6.59)
σ̂bj
When the error distribution is Normal, then t(bj ) follows the Student’s t Distribution. For example, the pdf
for the Student’s t Distribution with df = ν = 2 Degrees of Freedom is shown in the figure below (the t
Distribution approaches the Normal Distribution as ν increases).

− ν+1 −3/2 −3/2


Γ( ν+1 ) y2 y2 y2
  
2
Γ(3/2) 1
fy (y) = √ 2 ν 1+ = √ 1+ = √ 1+ (6.60)
νπΓ( 2 ) ν 2πΓ(1) 2 2 2 2

pdf for Student’s t Distribution (blue) vs. Normal (green)

0.4

0.3
fy (y)

0.2

0.1

0
−3 −2 −1 0 1 2 3
y

The corresponding p-value P (|y| > t) measures how significant the t-value is, e.g.,

Fy (−1.683018) = 0.0590926
P (|y| > −1.683018) = 2Fy (−1.683018)) = 0.118185 for ν = df = 12

Typically, the t-value is only considered significant if is in the tails of the Student’s t distribution. The
farther out in the tails, the less likely for the parameter to be non-zero (and hence be part of the model)
simply by chance. The p-value measures the risk (chance of being wrong) in including parameter bj and
therefore variable xj in the model.

195
The predict Method

Finally, a given new data vector z, the predict method may be used to predict its response value.
1 val z = VectorD (1.0 , 30.0 , 1000.0 , 100.0)
2 println ( s " predict (z) ={ mod . predict ( z ) } " )

Feature Selection

Feature selection (or Variable Selection) may be carried out by using either forwrardSel or backwardElim.
These methods add or remove one variable at a time. To iteratively add or remove, the following methods
may be called.
1 mod . forwardSelAll ( cross = false )
2 mod . ba ckw ar dE l im Al l ( cross = false )
3 mod . s t e p R e g r e s s i o n A l l ( cross = false )

The default criterion for choosing which variable to add/remove is Adjusted R2 . It may be changed via the
idx q parameter to the methods (see the Fit trait for the possible values for this parameter). Note: The
cross-validation is turned off (cross = false) due to the small size of the dataset.
The source code for the Texas Temperatures example is a test case in Regression.scala.

6.6.14 Regression Class

Class Methods:
1 @ param x the data / input m - by - n matrix
2 ( augment with a first column of ones to include intercept in model )
3 @ param y the response / output m - vector
4 @ param fname_ the feature / variable names ( defaults to null )
5 @ param hparam the hyper - parameters ( defaults to Regression . hp )
6

7 class Regression ( x : MatrixD , y : VectorD , fname_ : Array [ String ] = null ,


8 hparam : Hyp erParame ter = Regression . hp )
9 extends Predictor (x , y , fname_ , hparam )
10 with Fit ( dfm = x . dim2 - 1 , df = x . dim - x . dim2 ) :
11

12 def train ( x_ : MatrixD = x , y_ : VectorD = y ) : Unit =


13 def test ( x_ : MatrixD = x , y_ : VectorD = y ) : ( VectorD , VectorD ) =
14 override def summary ( x_ : MatrixD = getX , fname_ : Array [ String ] = fname ,
15 b_ : VectorD = b , vifs : VectorD = vif () ) : String =
16 override def predict ( x_ : MatrixD ) : VectorD = x_ * b
17 override def buildModel ( x_cols : MatrixD ) : Regression =

6.6.15 Exercises
| |
1. For Exercise 1 from the last section, compute A = X X and z = X y. Now solve the following linear
systems of equations for b.

196
Ab = z

2. Gradient descent can be used for Multiple Linear Regression as well. For gradient descent, pick a
starting point b0 , compute the gradient of the loss function ∇L and move −η∇L from b0 where η is
the learning rate. Write a Scala program that repeats this for several iterations for the above data.
What is happening to the value of the loss function L.

|
∇L = − X (y − Xb)

Substituting  = y − Xb, allows ∇L to be written as

|
−X 

Starting with data matrix x, response vector y and parameter vector b, in ScalaTion, the calculations
become
1 val yp = x * b // y predicted
2 val e = y - yp // error
3 val g = x .T * e // - gradient
4 b += g * eta // update parameter b
5 val h = 0.5 * ( e dot e ) // half the sum of squared errors

Unless the dataset is normalized, finding an appropriate learning rate eta may be difficult. See the
MatrixTransform object for details. Do this for the Blood Pressure Example BPressure dataset. Try
using another dataset.

3. Consider the relationships between the predictor variables and the response variable in the AutoMPG
dataset. This is a well know dataset that is available at multiple websites including the UCI Machine
Learning Repository http://archive.ics.uci.edu/ml/datasets/Auto+MPG. The response variable
is the miles per gallon (mpg: continuous) while the predictor variables are cylinders: multi-valued
discrete, displacement: continuous, horsepower: continuous, weight: continuous, acceleration:
continuous, model year: multi-valued discrete, origin: multi-valued discrete, and car name: string
(unique for each instance). Since the car name is unique and obviously not causal, this variable is
eliminated, leaving seven predictor variables. First compute the correlations between mpg (vector y)
and the seven predictor variables (each column vector x:j in matrix X).
1 val correlation = y corr x_j

and then plot mpg versus each of the predictor variables. The source code for this example is at
http://www.cs.uga.edu/~jam/scalation_2.0/src/main/scala/scalation/modeling/Example_AutoMPG.
scala .
Alternatively, a .csv file containing the AutoMPG dataset may be read into a relation called auto tab
from which data matrix x and response vector y may be produced. If the dataset has missing values,
they may be replaced using a spreadsheet or using the techniques discusses in the Data Preprocessing
Chapter.

197
1 val auto_tab = Relation ( BASE_DIR + " auto - mpg . csv " , " auto_mpg " , null , -1)
2 val (x , y ) = auto_tab . toMatrixDD (1 to 6 , 0)
3 println ( s " x = x”)println(s”y =y "

4. Apply Regression analysis on the AutoMPG dataset. Compare with results of applying the NullModel,
SimplerRegression and SimpleRegression. Try using SimplerRegression and SimpleRegression
with different predictor variables for these models. How does their R2 values compare to the correlation
analysis done in the previous exercise?

5. Examine the collinearity and multi-collinearity of the column vectors in the AutoMPG dataset.

6. For the AutoMPG dataset, repeatedly call the backwardElim method to remove the predictor variable
that contributes the least to the model. Show how the various quality of fit (QoF) measures change as
variables are eliminated. Do the same for the forwardSel method. Using R̄2 , select the best models
from the forward and backward approaches. Are they the same?

7. Compare model assessment and model validation. Compute sse, mse and R2 for the full and best
AutoMPG models trained on the entire data set. Compare this with the results of Leave-One-Out,
5-fold Cross-Validation and 10-fold Cross-Validation.

8. The variance of the estimate of parameter bj may be estimated as follows:

mse
σ̂ 2 (bj ) = · vif(bj )
k σ̂ 2 (xj )

Derive this formula. The standard error is the square root of this value. Use the estimate for bj and
its standard error to compute a t-value and p-value for the estimate. Run the AutoMPG model and
explain these values produced by the summary method.

9. Singular Value Decomposition Technique. In cases where the rank of the data/input matrix X
is not full or its multi-collinearity is high, a useful technique to solve for the parameters of the model is
Singular Value Decomposition (SVD). Based on the derivation given in http://www.ime.unicamp.br/
~marianar/MI602/material%20extra/svd-regression-analysis.pdf, we start with the equation
estimating y as the product of the data matrix X and the parameter vector b.

y = Xb

We then perform a singular value decomposition on the m-by-n matrix X

|
X = U ΣV

where in the full-rank case, U is an m-by-n orthogonal matrix, Σ is an n-by-n diagonal matrix of singular
|
values, and V is an n-by-n orthogonal matrix The r = rank(A) equals the number of nonzero singular
|
values in Σ, so in general, U is m-by-r, Σ is r-by-r, and V is r-by-n. The singular values are the
|
square roots of the nonzero eigenvalues of X X. Substituting for X yields

|
y = U ΣV b

198
|
Defining d = ΣV b, we may write

y = Ud

This can be viewed as a estimating equation where X is replaced with U and b is replaced with d.
Consequently, a least squares solution for the alternate parameter vector d is given by

| |
d = (U U )−1 U y
|
Since U U = I, this reduces to

|
d = U y

If rank(A) = n (full-rank), then the conventional parameters b may be obtained as follows:

b = V Σ−1 d

where Σ−1 is a diagonal matrix where elements on the main diagonal are the reciprocals of the singular
values.

10. Improve Stepwise Regression. Write ScalaTion code to improve the stepRegressionAll method
by implementing the swapping operation. Then redo exercise 6 using all three: Forward Selection,
Backward Elimination, and Stepwise Regression with all four criteria: R2 , R̄2 , Rcv
2
, and AIC. Plot the
curve for each criterion, determine the best number of variables and what these variables are. Compare
the four criteria.
As part of a larger project compare this form of feature selection with that provided by Ridge Regression
and Lasso Regression. See the next two sections.
Now add features including quadratic terms, cubic terms, and dummy variables to the model using
SymbolicRegression.quadratic, SymbolicRegression.cubic, and RegressionCat. See the subse-
quent sections.
In addition to the AutoMPG dataset, use the Concrete dataset and three more datasets from UCI
Machine Learning Repository. The UCI datasets should have more instances (m) and variables (n)
than the first two datasets. The testing should also be done in R or Python.

11. Regression as Projection. Consider the following six vectors/points in 3D space where the response
variable y is modeled as a linear function of predictor variables x1 and x2 .
1 // x1 x2 y
2 val xy = MatrixD ((6 , 3) , 1 , 1, 2.8 ,
3 1, 2, 4.2 ,
4 1, 3, 4.8 ,
5 2, 1, 5.3 ,
6 2, 2, 5.5
7 2, 3, 6.5)

199
5

y
0 2
0
0.5 1 x2
1.5 2 0
x1

Consider a regression model equation with no intercept.

y = b0 x1 + b1 x2 + 

Determine the plane (response surface) that these six points are projected onto.

ŷ = b0 x1 + b1 x2

For this problem, the number of instances m = 6 and the number of parameters/predictor variables
n = 2. Determine the number of Degrees of Freedom for the model dfm and the number of Degrees of
Freedom for the residuals/errors df .
|
12. Given a data matrix X ∈ Rm×2 and response vector y ∈ Rm where X = [1, x], compute X X and
|
X y. Use these to set up an augmented matrix and then apply LU Factorization to make it upper
triangular. Solve for the parameters b0 and b1 symbolically. Simply to reproduce formulas for b0 and
b1 for Simple Regression.
| |
13. Recall that ŷ = Hy where the hat matrix is X(X X)−1 X . The leverage of point i is defined to be
hii .

| |
hii = xi (X X)−1 xi (6.61)

The main diagonal of the hat matrix gives the leverage for each of the points. Points with high leverage
are those above a threshold such as

2 tr(H)
hii ≥ (6.62)
m

Note, that the trace tr(H) = rank(H) = rank(X) will equal n when X has full rank. List the high
leverage points for the Example AutoMPG dataset.

200
14. Points that are influential in determining values for model coefficients/parameters combine high lever-
age with large residuals. Measures of influence include Cook’s Distance, DFFITS, and DFBETAS
[34] and see http://home.iitk.ac.in/~shalab/regression/Chapter6-Regression-Diagnostic%
20for%20Leverage%20and%20Influence.pdf. These measures can also be useful in detecting po-
tential outliers. Compute these measures for the Example AutoMPG dataset.

15. The best two predictor variables for AutoMPG are weight and modelyear and with the weight given
in units of 1000 pounds, the prediction equation for the Regression model (with intercept) is

ŷ = − 14.3473 − 6.63208x1 + 0.757318x2

The corresponding hyperplane is show in Figure 6.4

40

20
y

0 80
0 75
2
4 x2
x1 6 70

Figure 6.4: Hyperplane: ŷ = −14.3473 − 6.63208x1 + 0.757318x2

Make a plot of the hyperplane for the second best combination of features. Compare the QoF of these
two models and explain how the feature combinations affect the response variable (mpg).

16. State and explain the conditions required for the Ordinary Least Squares (OLS) estimate of parameter
vector b for multiple linear regression to be B.L.U.E. See the Gauss-Markov Theorem. B.L.U.E. stands
for Best Linear Unbiased Estimator.

6.6.16 Further Reading


1. Introduction to Linear Regression Analysis, 5th Edition [127]

2. Regression: Linear Models in Statistics [18]

201
6.7 Ridge Regression
The RidgeRegression class supports multiple linear ridge regression. As with Regression, the predictor
variables x are multi-dimensional [x1 , . . . , xk ], as are the parameters b = [b1 , . . . , bk ]. Ridge regression adds
a penalty based on the `2 norm of the parameters b to reduce the chance of them taking on large values
that may lead to less robust models.
The penalty holds down the values of the parameters and this may result in several advantages: (1)
better out-of-sample (e.g., cross-validation) quality of fit, (2) reduced impact of multi-collinearity, (3) turn
singular matrices, non-singular, and to a limited extent (4) eliminate features/predictor variables from the
model.
The penalty is not to be included on the intercept parameter b0 , as this would shift predictions in a way
that would adversely affect the quality of the model. See the exercise on scale invariance.

6.7.1 Model Equation


Centering of the data allows the intercept to be removed from the model. The combined centering on both
the predictor variables and the response variable takes care of the intercept, so it is not included in the
model. Thus, the goal is to fit the parameter vector b in the model/regression equation,

y = b · x +  = b1 x1 + · · · + bk xk +  (6.63)

where  represents the residuals (the part not explained by the model).

6.7.2 Training
Centering the dataset (X, y) has the following effects: First, when the X matrix is centered, the intercept
b0 = µy . Second, when y is centered, µy becomes zero, implying b0 = 0. To rescale back to the original
response values, µy can be added back during prediction. Therefore, both the data/input matrix X and the
response/output vector y should be centered (zero mean).

X (c) = X − µx subtract predictor column means (6.64)


(c)
y = y − µy subtract response mean (6.65)

The regularization of the model adds an `2 -penalty on the parameters b. The objective function to
minimize is now the loss function L(b) = 12 sse plus the `2 -penalty.

1 1 1
fobj = L(b) + λ kbk2 =  ·  + λ b · b (6.66)
2 2 2

where λ is the shrinkage parameter. A large value for λ will drive the parameters b toward zero, while a
small value can help stabilize the model (e.g., for nearly singular matrices or high multi-collinearity).

1 1
fobj = (y − Xb) · (y − Xb) + λ b · b (6.67)
2 2

202
6.7.3 Optimization
Fortunately, the quadratic nature of the penalty function allows it to be combined easily with the quadratic
error terms, so that matrix factorization can still be used for finding optimal values for parameters.
Taking the gradient of the objective function fobj with respect to b and then setting it equal to zero
yields

|
− X (y − Xb) + λb = 0 (6.68)

Recall the first term of the gradient was derived in the Regression section. See the exercises below for
deriving the last term of the gradient. Multiplying out gives,

| |
−X y + (X X)b + λb = 0
| |
(X X)b + λb = X y

Since λb = λIb where I is the n-by-n identity matrix, we may write

| |
(X X + λI)b = X y (6.69)

Matrix factorization may now be used to solve for the parameters b in the modified Normal Equations. For
example, use of matrix inversion yields,

| |
b = (X X + λI)−1 X y (6.70)
|
For Cholesky factorization, one may compute X X and simply add λ to each of the diagonal elements (i.e,
along the ridge). QR and SVD factorizations require similar, but slightly more complicated, modifications.
Note, use of SVD can improve the efficiency of searching for an optimal value for λ [71, 196].

6.7.4 Centering
Before creating a RidgeRegression model, the X data matrix and the y response vector should be centered.
This is accomplished by subtracting the means (vector of column means for X and a mean value for y).
1 val mu_x = x . mean // column - wise mean of x
2 val mu_y = y . mean // mean of y
3 val x_c = x - mu_x // centered x ( column - wise )
4 val y_c = y - mu_y // centered y

The centered matrix x c and center vector y c are then passed into the RidgeRegression constructor.
1 val mod = new R id ge R eg re ss i on ( x_c , y_c )
2 mod . trainNtest ()

Now, when making predictions, the new data vector z needs to be centered by subtracting mu x. Then the
predict method is called, after which the mean of y is added.
1 val z_c = z - mu_x // center z first
2 yp = mod . predict ( z_c ) + mu_y // predict z_c and add y ’s mean
3 println ( s " predict (z) =yp " )

203
6.7.5 The λ Hyper-parameter
The value for λ can be user specified (typically a small value) or chosen by a method like findLambda. It
finds a roughly optimal value for the shrinkage parameter λ based on the cross-validated sum of squared
errors sse cv. The search starts with the low default value for λ and then doubles it with each iteration,
returning the minimizing λ and its corresponding cross-validated sse. A more precise search could be used
to provide a better value for λ.
1 def findLambda : ( Double , Double ) =
2 var l = lambda // start with a small default value
3 var l_best = l
4 var sse = Double . MaxValue
5 for i <- 0 to 20 do
6 R id ge Re g re ss io n . hp ( " lambda " ) = l
7 val mod = new R id ge R eg re ss i on (x , y )
8 val stats = mod . crossValidate ()
9 val sse2 = stats ( QoF . sse . ordinal ) . mean
10 banner ( s " R idgeReg ession with lambda = $mod . lambda_ } has sse = $sse2 " )
11 if sse2 < sse then { sse = sse2 ; l_best = l }
12 l *= 2
13 end for
14 ( l_best , sse ) // best lambda and its sse_cv
15 end findLambda

6.7.6 Comparing RidgeRegression with Regression


This subsection compares the results of RidgeRegression with those of Regression by examining the
estimated parameter vectors, the quality of fit, predictions made, and comparing the summary information.
1 // 5 data points : x_0 x_1
2 val x = MatrixD ((5 , 2) , 36.0 , 66.0 , // 5 - by -2 matrix data matrix
3 37.0 , 68.0 ,
4 47.0 , 64.0 ,
5 32.0 , 53.0 ,
6 1.0 , 101.0)
7 val y = VectorD (745.0 , 895.0 , 442.0 , 440.0 , 1598.0) // 5 - dim response vector

First, create a Regression model with an intercept and produce a summary.


1 banner ( " Regression " )
2 val ox = VectorD . one ( y . dim ) + ˆ : x // prepend column of all 1 ’ s
3 val rg = new Regression ( ox , y ) // create a Regression model
4 rg . trainNtest () () // train and test the model

Second, create a RidgeRegression model using the centered data


1 banner ( " Ri dg e Re gr e ss io n " )
2 val mu_x = x . mean // column - wise mean of x
3 val mu_y = y . mean // mean of y
4 val x_c = x - mu_x // centered x ( column - wise )
5 val y_c = y - mu_y // centered y
6 val mod = new Ri dg e Re gr es s io n ( x_c , y_c ) // create a Ridge Regression
7 mod . trainNtest () () // train and test the model

Third, predict a value for new input vector z using each model.

204
1 banner ( " Make Predictions " )
2 val z = VectorD (20.0 , 80.0) // new instance to predict
3 val _1z = VectorD .++ (1.0 , z ) // prepend 1 to z
4 val z_c = z - mu_x // center z
5 println ( s " rg . predict (z) ={ rg . predict ( _1z ) } " ) // predict using _1z
6 println ( s " mod . predict (z) ={ mod . predict ( z_c ) + mu_y } " ) // predict using z_c and add
y ’s mean

The summary information for Regression is shown below.


1 println ( rg . summary () )

SUMMARY
Parameters/Coefficients:
Var Estimate Std. Error t value Pr(>|t|) VIF
----------------------------------------------------------------------------------
x0 -281.426985 835.349154 -0.336897 0.768262 NA
x1 -7.611030 8.722908 -0.872534 0.474922 3.653976
x2 19.010291 8.423716 2.256758 0.152633 3.653976

Residual standard error: 159.206002 on 2.0 degrees of freedom


Multiple R-squared: 0.943907, Adjusted R-squared: 0.887815
F-statistic: 16.827632701228243 on 2.0 and 2.0 DF, p-value: 0.05609269703717268
----------------------------------------------------------------------------------

The summary information for RidgeRegression is shown below.


1 println ( mod . summary () )

SUMMARY
Parameters/Coefficients:
Var Estimate Std. Error t value Pr(>|t|) VIF
----------------------------------------------------------------------------------
x0 -7.611271 8.722908 -0.872561 0.474910 NA
x1 19.009947 8.423716 2.256717 0.152638 3.653976

Residual standard error: 159.206002 on 2.0 degrees of freedom


Multiple R-squared: 0.943907, Adjusted R-squared: 0.887815
F-statistic: 16.827632684676626 on 2.0 and 2.0 DF, p-value: 0.056092697089250576
----------------------------------------------------------------------------------

Notice there is very little difference between the two models. Try increasing the value of the shrinkage
hyper-parameter λ beyond its default value of 0.01. This example can be run as follows:

$ sbt
sbt> runMain scalation.modeling.ridgeRegressionTest

Automatic Centering

ScalaTion provides factory methods, apply and center, in the RigdgeRgression companion object that
center the data for the user.

205
1 // val mod = R id g eR eg re s si on ( xy , fname ) // apply takes a combined matrix xy
2 val mod = R id ge Re g re ss io n . center (x , y , fname ) // center takes a matrix x and
vector y
3 mod . trainNTest () ()
4 val yp = mod . predict ( z - x . mean ) + y . mean

The user must still center any vectors passed into the predict method and add back the response mean at
the end, e.g., pass z - x.mean and add back y.mean.
Note, care should be taken regarding x.mean and y.mean when preforming validation or crossValidation.
The means for the full, training and testing sets may differ.

6.7.7 RidgeRegression Class

Class Methods:
1 @ param x the centered data / input m - by - n matrix NOT augmented with a column of 1 s
2 @ param y the centered response / output m - vector
3 @ param fname_ the feature / variable names ( defaults to null )
4 @ param hparam the shrinkage hyper - parameter , lambda (0 = > OLS ) in the penalty term
5 ’ lambda * b dot b ’
6

7 class Ri dg e Re gr es s io n ( x : MatrixD , y : VectorD , fname_ : Array [ String ] = null ,


8 hparam : Hyp erParame ter = Ri d ge Re gr e ss io n . hp )
9 extends Predictor (x , y , fname_ , hparam )
10 with Fit ( dfm = x . dim2 , df = x . dim - x . dim2 - 1) :
11

12 def lambda_ : Double = lambda


13 def train ( x_ : MatrixD = x , y_ : VectorD = y ) : Unit =
14 def test ( x_ : MatrixD = x , y_ : VectorD = y ) : ( VectorD , VectorD ) =
15 def findLambda : ( Double , Double ) =
16 override def predict ( x_ : MatrixD ) : VectorD = x_ * b
17 override def summary ( x_ : MatrixD = getX , fname_ : Array [ String ] = fname ,
18 b_ : VectorD = b , vifs : VectorD = vif () ) : String =
19 override def buildModel ( x_cols : MatrixD ) : Ri d ge Re gr e ss io n =

6.7.8 Exercises
1. Based on the example given in this section, try increasing the value of the hyper-parameter λ and
examine its effect on the parameter vector b, the quality of fit and predictions made.
1 import R id g eR eg re s si on . hp
2

3 println ( s " hp = $hp " )


4 val hp2 = hp . updateReturn ( " lambda " , 1.0)
5 println ( s " hp2 = $hp2 " )

Alternatively,
1 hp ( " lambda " ) = 1.0

206
See the HyperParameter class in the scalation.modeling package for details.

2. For the AutoMPG dataset, use the findLambda method find a value for λ that roughly minimizes
out-of-sample sse cv based on using the crossValidate method. Plot sse cv vs. λ.

3. Why is it important to center (zero mean) both the data matrix X and the response vector y? What
is scale invariance and how does it relate to centering the data?

4. The Degrees of Freedom (DoF) used in ScalaTion’s RidgeRegression class is approximate. As the
shrinkage parameter λ increases the effective DoF (eDoF) should be used instead. A general definition
| |
of effective DoF is the trace tr of the hat matrix H = X(X X + λI)−1 X

eff
dfm = tr(H)

Read [86] and explain the difference between DoF and effective DoF (eDoF) for Ridge Regression.

5. A matrix that is close to singularity is said to be ill-conditioned. The condition number κ of a matrix
|
A (e.g., A = X X) is defined as follows:

κ = kAk kA−1 k ≥ 1

When κ becomes large the matrix is considered to be ill-conditioned. In such cases, it is recommended
to use QR or SVD Factorization for Least-Squares Regression [39]. Compute the condition number for
|
of X X for various datasets.

6. For the last term of the gradient of the objective function, show that

∂ λ λX 2
b·b = b = λbj
∂bj 2 2 i i

λ
Put these together to show that ∇ b · b = λb
2
7. For over-parameterized (or under-determined) regression where the n > m, (number of parameters >
number of instances) it is common to seek a min-norm solution.

minb {kbk22 | y = Xb}

Use http://people.csail.mit.edu/bkph/articles/Pseudo_Inverse.pdf to derive a solution for


the parameters b.

| |
b = X (XX )−1 y

Compare the use of regression and ridge regression of such problems.

8. Compare different algorithms for finding a suitable value for the shrinkage parameter λ.
Hint: see Lecture Notes on Ridge Regression - https://arxiv.org/pdf/1509.09169.pdf - [196].

207
6.8 Lasso Regression
The LassoRegression class supports multiple linear regression using the Least absolute shrinkage and
selection operator (Lasso) that constrains the values of the b parameters and effectively sets those with low
impact to zero (thereby deselecting such variables/features). Rather than using an `2 -penalty (Euclidean
norm) like RidgeRegression, it uses and an `1 -penalty (Manhattan norm). In RidgeRegression when bj
approaches zero, b2j becomes very small and has little effect on the penalty. For LassoRegression, the
effect based on |bj | will be larger, so it is more likely to set parameters to zero. See section 6.2.2 in [85]
for a more detailed explanation on how LassoRegression can eliminate a variable/feature by setting its
parameter/coefficient to zero.

6.8.1 Model Equation


As with Regression, the goal is to fit the parameter vector b ∈ Rn (k = n − 1) in the model/regression
equation,

y = b · x +  = b0 + b1 x1 + ... + bk xk +  (6.71)

where  represents the residuals (the part not explained by the model). See the exercise that considers
whether to include the intercept b0 in the shrinkage.

6.8.2 Training
The regularization of the model adds an `1 -penalty on the parameters b. The objective function to minimize
is now the loss function L(b) = 21 sse plus the penalty.

1 1
fobj = sse + λ kbk1 = kk22 + λ kbk1 (6.72)
2 2
where λ is the shrinkage parameter. Substituting  = y − Xb yields,

1
fobj = ky − Xbk22 + λ kbk1 (6.73)
2
Replacing the norms with dot products gives,

1
fobj = (y − Xb) · (y − Xb) + λ 1 · |b| (6.74)
2
Although similar to the `2 -penalty used in Ridge Regression, it may often be more effective. Still, the
` -penalty for Lasso has a disadvantage that the absolute values in the `1 norm make the objective function
1

non-differentiable.

k
X
λ 1 · |b| = λ |bj | (6.75)
j=0

Therefore, the straightforward strategy of setting the gradient equal to zero to develop appropriate modified
Normal Equations that allow the parameters to be determined by matrix factorization will no longer works.
Instead, the objective function needs to be minimized using a search based optimization algorithm.

208
6.8.3 Optimization Strategies
There are multiple optimization algorithms that can be applied for parameter estimation in Lasso Regression.

Coordinate Descent

Coordinate Descent attempts to optimize one variable/feature at a time (repeated one dimensional optimiza-
tion). For normalized data the following algorithm has been shown to work: https://xavierbourretsicotte.
github.io/lasso_implementation.html.

Alternating Direction Method of Multipliers

ScalaTion uses the Alternating Direction Method of Multipliers (ADMM) [22] algorithm to optimize the b
parameter vector. The algorithm for using ADMM for Lasso Regression is outlined in section 6.4 of Boyd [22]
(https://stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf). Optimization problems in ADMM
form separate the objective function into two parts f and g.

min f (b) + g(z) subject to b − z = 0 (6.76)

For Lasso Regression, the f function will capture the loss function ( 12 sse), while the g function will capture
the `1 regularization, i.e.,

1
f (b) = ky − Xbk22 , g(z) = λ kzk1 (6.77)
2
Introducing z allows the functions to be separated, while the constraint keeps z and b close. Therefore, the
iterative step in the ADMM optimization algorithm becomes
| |
b = (X X + ρI)−1 (X y + ρ(z − u))
z = Sλ/ρ (b + u)
u = u+b−z

where u is the vector of Lagrange multipliers and Sλ is the soft thresholding function.

Sλ (ζ) = sign(ζ) (|ζ| − λ)+ (6.78)

Note (a)+ = max(a, 0).


The details of ADMM for Lasso Regression are left as an exercise. In addtion to Boyd, see [73] and see
scalation.optimization.LassoAdmm for coding details.

6.8.4 The λ Hyper-parameter


The shrinkage parameter λ can be tuned to control feature selection. The larger the value of λ, the more
features (predictor variables) whose parameters/coefficients will be set to zero. The findLambda method
may be used to find a value lambda that improves the cross-validated sum of squared errors sse cv.
1 def findLambda : ( Double , Double ) =
2 var l = lambda
3 var l_best = l
4 var sse = Double . MaxValue
5 for i <- 0 to 20 do

209
6 L as so Re g re ss io n . hp ( " lambda " ) = l
7 val mod = new L as so R eg re ss i on (x , y )
8 val stats = mod . crossValidate ()
9 val sse2 = stats ( QoF . sse . ordinal ) . mean
10 banner ( s " L assoReg ession with lambda = $mod . lambda_ } has sse = $sse2 " )
11 if sse2 < sse then
12 sse = sse2 ; l_best = l
13 end if
14 Fit . s h o w Q o f S t a t T a b l e ( stats )
15 l *= 2
16 end for
17 ( l_best , sse )
18 end findLambda

As the default value for the shrinkage/penalty parameter λ is very small, the optimal solution will be
close to the Ordinary Least Squares (OLS) solution shown in green at b = [b1 , b2 ] = [3, 1] in Figure 6.5.
Increasing the penalty parameter will pull the optimal b towards the origin. At any given point in the plane,
the objective function is the sum of the loss function L(b) and the penalty function p(b). The contours in
blue show points of equal height for the penalty function, while those in black show the same for the loss
function. Suppose for some λ the point [2, 0] is this penalized optimum. This would mean that moving
toward the origin would be non-productive, as the increase in the loss would exceed the drop in the penalty.
On the other hand, moving toward [3, 1] would be non-productive as the increase in the penalty would exceed
the drop in the loss. Notice in this case, that the penalty has pulled the b1 parameter to zero (an example of
feature selection). Ridge regression will be less likely to pull a parameter to zero, as its contours are circles
rather than diamonds. Lasso regression’s contours have sharp points on the axis which thereby increase the
chance of intersecting a loss contour on an axis.

b2

−2 −1 1 2 3 4 b1

−1

−2

Figure 6.5: Contour Curves for Lasso Regression

210
6.8.5 Regularized and Robust Regression
Regularized and Robust Regression are useful in many cases including high-dimensional data, correlated
data, non-normal data and data with outliers [70]. These techniques work by adding a `1 and/or `2 -penalty
terms to shrink the parameters and/or changing from an `2 to `1 loss function. Modeling techniques include
Ridge, Lasso, Elastic Nets, Least Absolute Deviation (LAD) and Adaptive LAD [70].

6.8.6 LassoRegression Class

Class Methods:
1 @ param x the data / input m - by - n matrix
2 @ param y the response / output m - vector
3 @ param fname_ the feature / variable names ( defaults to null )
4 @ param hparam the shrinkage hyper - parameter , lambda (0 = > OLS ) in the penalty term
5 ’ lambda * b dot b ’
6

7 class La ss o Re gr es s io n ( x : MatrixD , y : VectorD , fname_ : Array [ String ] = null ,


8 hparam : Hyp erParame ter = La s so Re gr e ss io n . hp )
9 extends Predictor (x , y , fname_ , hparam )
10 with Fit ( dfm = x . dim2 - 1 , df = x . dim - x . dim2 ) :
11

12 def lambda_ : Double = lambda


13 def findLambda : ( Double , Double ) =
14 def train ( x_ : MatrixD = x , y_ : VectorD = y ) : Unit =
15 def test ( x_ : MatrixD = x , y_ : VectorD = y ) : ( VectorD , VectorD ) =
16 override def summary ( x_ : MatrixD = getX , fname_ : Array [ String ] = fname ,
17 b_ : VectorD = b , vifs : VectorD = vif () ) : String =
18 override def buildModel ( x_cols : MatrixD ) : La s so Re gr e ss io n =

6.8.7 Exercises
1. Compare the results of LassoRegression with those of Regression and RidgeRegression. Examine
the parameter vectors, quality of fit and predictions made.
1 // 5 data points : one x_0 x_1
2 val x = MatrixD ((5 , 3) , 1.0 , 36.0 , 66.0 , // 5 - by -3 matrix
3 1.0 , 37.0 , 68.0 ,
4 1.0 , 47.0 , 64.0 ,
5 1.0 , 32.0 , 53.0 ,
6 1.0 , 1.0 , 101.0)
7 val y = VectorD (745.0 , 895.0 , 442.0 , 440.0 , 1598.0)
8 val z = VectorD (1.0 , 20.0 , 80.0)
9

10 // Create a L as s oR eg re s si on model
11

12 val lrg = new L as so R eg re ss i on (x , y )


13

14 // Predict a value for new input vector z using each model .

211
2. Based on the last exercise, try increasing the value of the hyper-parameter λ and examine its effect on
the parameter vector b, the quality of fit and predictions made.
1 import L as s oR eg re s si on . hp
2

3 println ( s " hp = $hp " )


4 val hp2 = hp . updateReturn ( " lambda " , 1.0)
5 println ( s " hp2 = $hp2 " )

3. Using the above dataset and the AutoMPG dataset, determine the effects of (a) centering the data
(µ = 0), (b) standardizing the data (µ = 0, σ = 1).
1 import M a t r i xT r a n s f o r m s . _
2

3 val x_n = normalize (x , ( mu_x , sig_x ) )


4 val y_n = y . standardize

4. Explain how the Coordinate Descent Optimization Algorithm works for Lasso Regression. See
https://xavierbourretsicotte.github.io/lasso_implementation.html.

5. Explain how the ADMM Optimization Algorithm works for Lasso Regression. See
https://stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf.

6. Compare LassoRegression the with Regression that uses forward selection or backward elimination
for feature selection. What are the advantages and disadvantages of each for feature selection.

7. Compare LassoRegression the with Regression on the AutoMPG dataset. Specifically, compare the
quality of fit measures as well as how well feature selection works.

8. Show that the contour curves for the Simple Regression loss function L(b0 , b1 ) are ellipses. The general
equation of an ellipse centered at (h, k) is

A(x − h)2 + B(x − h)(y − k) + C(y − k)2 = 1

where A, B > 0 and B 2 < 4AC.

9. Elastic Nets combine both `2 and `1 penalties to try to combine the best features of both RidgeRegression
and LassoRegression. Elastic Nets naturally includes two shrinkage parameters, λ1 and λ2 . Is the
additional complexity worth the benefits?

10. Regularization using Lasso has the nice property of being able to force parameters/coefficients to zero,
but this may require a large shrinkage hyper-parameter λ that shrinks non-zero coefficients more than
desired. Newer regularization techniques reduce the shrinkage effect compared to Lasso, by having a
penalty profile that matches Lasso for small coefficients, but is below Lasso for large coefficient values.
Make of plot of the penalty profiles for Lasso, Smoothly Clipped Absolute Deviations (SCAD) and
Mimimax Concave Penalty (MCP).

212
6.8.8 Further Reading
1. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers
[22]

2. The Application of Alternating Direction Method of Multipliers on `1 -norms Problems [73]

3. Feature Selection Using LASSO [49]

213
6.9 Quadratic Regression
The quadratic method in the SymbolicRegression object adds quadratic terms into the model. It can
often be the case that the response variable y will have a nonlinear relationship with one more of the predictor
variable xj . The simplest such nonlinear relationship is a quadratic relationship. Looking at a plot of y vs.
xj , it may be evident that a bending curve will fit the data much better than a straight line. For example,
a particle under constant acceleration will have a position that changes quadratically with time.
When there is only one predictor variable x, the response y is modeled as a quadratic function of x
(forming a parabola).

y = b0 + b1 x + b2 x2 +  (6.79)

The quadratic method achieves this simply by expanding the data matrix. From the dataset (initial
data matrix), all columns will have another column added that contains the values of the original column
squared. It is important that the initial data matrix has no intercept. The expansion will optionally
add an intercept column (column of all ones). Since 12 = 1, the ones columns and its square will be perfectly
collinear and make the matrix singular, if the user includes a ones column.

6.9.1 Model Equation


In two dimensions (2D) where x = [x1 , x2 ], the quadratic model/regression equation is the following:

y = b · x0 +  = b0 + b1 x1 + b2 x2 + b3 x21 + b4 x22 +  (6.80)

where x0 = [1, x1 , x2 , x21 , x22 ], b = [b0 , b1 , b2 , b3 , b4 ], and  represents the residuals (the part not explained by
the model).
The number of terms (nt) in the model increases linearly with the dimensionality of the space (n)
according to the following formula:

nt = 2n + 1 e.g., nt = 5 for n = 2 (6.81)

Each column in the initial data matrix is expanded into two in the expanded data matrix and an intercept
column is optionally added.

6.9.2 Comparison of quadratic and Regression


This subsection compares results and Quality of Fit of the quadratic method in the SymbolicRegression
object to the Regression class. Factors in choosing between the two include the accuracy of the model and
information provided in by the summary method (e.g., p-values and VIF).
1 // 8 data points : x y
2 val xy = MatrixD ((8 , 2) , 1 , 2, // 8 - by -2 combined matrix
3 2, 5,
4 3, 10 ,
5 4, 15 ,
6 5, 20 ,
7 6, 30 ,
8 7, 50 ,
9 8, 60)

214
10 val (x , y ) = ( xy . not (? , 1) , xy (? , 1) ) // x is first column , y is last column
11 val ox = VectorD . one ( xy . dim ) + ˆ : x // prepend a column of all ones
12

13 val rg = new Regression ( ox , y ) // create a regression model


14 rg . trainNtest () () // train and test the model
15 println ( rg . summary () ) // show summary
16

17 val qrg = S y m b o l i c R e g r e s s i o n . quadratic (x , y ) // create a quadratic regression model


18 qrg . trainNtest () () // train and test the model
19 println ( qrg . summary () ) // show summary

Now compare their summary results. The summary results for the Regression model are shown below:

SUMMARY
Parameters/Coefficients:
Var Estimate Std. Error t value Pr(>|t|) VIF
----------------------------------------------------------------------------------
x0 -13.285714 5.154583 -2.577457 0.041913 NA
x1 8.285714 1.020760 8.117205 0.000188 1.000000

Residual standard error: 6.615278 on 6.0 degrees of freedom


Multiple R-squared: 0.916538, Adjusted R-squared: 0.902628
F-statistic: 65.88900979325355 on 1.0 and 6.0 DF, p-value: 1.8767258045970792E-4
----------------------------------------------------------------------------------

The summary results for the SymbolicRegression.quadratic model are given here:

SUMMARY
Parameters/Coefficients:
Var Estimate Std. Error t value Pr(>|t|) VIF
----------------------------------------------------------------------------------
x0 4.035714 3.873763 1.041807 0.345231 NA
x1 -2.107143 1.975007 -1.066904 0.334798 21.250000
x2 1.154762 0.214220 5.390553 0.002965 21.250000

Residual standard error: 2.776603 on 5.0 degrees of freedom


Multiple R-squared: 0.987747, Adjusted R-squared: 0.982846
F-statistic: 201.53335392217406 on 2.0 and 5.0 DF, p-value: 4.872837579850131E-5
----------------------------------------------------------------------------------

The summary results for the SymbolicRegression.quadratic model highlight a couple of important
issues:

1. moderately high Pr(>|t|) (p-values) and

2. borderline high VIF (Variance Inflation Factor) values.

Try eliminating x1 to see if these two improve without much of a drop in Adjusted R-squared R̄2 . Note,
eliminating x1 makes the model non-hierarchical (see the exercises). Figure 6.6 shows the predictions (yp)
of the Regression and quadratic models.

215
y
60

50

40

30

20

10

1 2 3 4 5 6 7 8 x
−10

Figure 6.6: Actual y (red) vs. Regression yp (green) vs. quadratic yp (blue)

The quadratic method in the SymbolicRegression object creates a Regression object that uses mul-
tiple regression to fit a quadratic surface to the data.

6.9.3 SymbolicRegression.quadratic Method

Method:
1 @ param x the initial data / input m - by - n matrix ( before quadratic term expansion )
2 must not include an intercept column of all ones
3 @ param y the response / output m - vector
4 @ param fname the feature / variable names ( defaults to null )
5 @ param intercept whether to include the intercept term ( column of ones ) _1
6 ( defaults to true )
7 @ param cross whether to include 2 - way cross / interaction terms x_i x_j
8 ( defaults to false )
9 @ param hparam the hyper - parameters ( defaults to Regression . hp )
10

11 def quadratic ( x : MatrixD , y : VectorD , fname : Array [ String ] = null ,


12 intercept : Boolean = true , cross : Boolean = false ,
13 hparam : Hype rParame ter = Regression . hp ) : Regression =
14 val mod = apply (x , y , fname , Set (1 , 2) , intercept , cross , false , hparam )
15 mod . modelName = " S y m b o l i c R e g r e s s i o n . quadratic " + ( if cross then " X " else " " )
16 mod
17 end quadratic

The apply method is defined in the SymbolicRegression object. The Set (1, 2) specifies that first
(Linear) and second (Quadratic) order terms will be included in the model. The intercept flag indicates
whether a column of ones will be added to the input/data matrix.

216
The next few modeling techniques described in subsequent sections support the development of low-order
multi-dimensional polynomial regression models. Higher order polynomial regression models are typically
restricted to one-dimensional problems (see the PolyRegression class).

6.9.4 Quadratic Regression with Cross Terms


The quadratic method provides the option of adding cross/interaction terms in addition to the quadratic
terms. The cross flag indicates whether cross terms will be added to the model. A cross term such as the
one based on the product x1 x2 indicates a combined effect of two predictor variables on the response variable
y.

Model Equation

In two dimensions (2D) where x = [x1 , x2 ], the quadratic cross model/regression equation is the following:

y = b · x0 +  = b0 + b1 x1 + b2 x2 + b3 x21 + b4 x22 + b5 x1 x2 +  (6.82)


where the components of the model equation are defined as follows:

x0 = [1, x1 , x2 , x21 , x22 , x1 x2 ] expanded input vector


b = [b0 , b1 , b2 , b3 , b4 , b5 ] parameter/coefficient vector
0
 = y−b·x error/residual

The number of terms (nt) in the model increases quadratically with the dimensionality of the space (n)
according to the formula for triangular numbers shifted by (n → n + 1).
 
n+2 (n + 2)(n + 1)
nt = = e.g., nt = 6 for n = 2 (6.83)
2 2
This result may derived by summing the number of constant terms (1), linear terms (n), quadratic terms
(n), and cross terms n2 .


Such models generalize quadratic by introducing cross terms, e.g., x1 x2 . Adding cross terms makes the
number of terms increase quadratically rather than linearly with the dimensionality. Consequently, multi-
collinearity problems (check VIF scores) may be intensified and the need for feature selection, therefore,
increases.

6.9.5 Response Surface


One may think of a quadratic model as well as more complex models as approximating a response surface
in multiple dimensions.

y = f (x1 , x2 ) +  (6.84)
For example, a model with two predictor variables and one response variable may be displayed in three
dimensions. Such a response surface can also be shown in two dimensions using contour plots where a
contour/curve shows points of equal height. Figure 6.7 shows three types of contours that represent the
types of terms in quadratic regression (1) linear terms, (2) quadratic terms, and (3) cross terms. In the

217
figure, the first green line is for x1 + x2 = 4, the first blue curve is for x21 + x22 = 16, and the first red curve
is for x1 x2 = 4.

x2

1 2 3 4 5 6 x1

Figure 6.7: quadratic Contours: x1 + x2 (green), x21 + x22 (blue), x1 x2 (red)

A constant term simply moves the whole response surface up or down. The coefficients for each of terms can
rotate and stretch these curves.
The response surface for Quadratic Regression on AutoMPG based on the best combination of features,
weight and modelyear, is shown in 6.8.

50
y

80
0 75
2
4 x2
x1 6 70

Figure 6.8: Response Surface: ŷ = 355.139 − 21.1463x1 − 8.50562x2 + 2.29950x21 + 0.0614339x22

218
6.9.6 Exercises
1. Enter the x, y dataset from the example given in this section and use it to create a quadratic model.
Show the expanded input/data matrix and the response vector using the following two print statements.
1 val qrg = new S y m b o l i c R e g r e s s i o n . quadratic (x , y )
2 println ( s " expanded x = ${ qrg . getX } " )
3 println ( s " y = ${ qrg . getY } " )

2. Perform Quadratic Regression on the Example BPressure dataset using the first two columns of its
data matrix x.
1 import E x a m p l e _ B P r e s s u r e .{ x01 = > x , y }

3. Perform both forward selection and backward elimination to find out which of the terms have the most
impact on predicting the response. Which feature selection approach (forward selection or backward
elimination) finds a model with the highest R¯2 ?

4. Generate a dataset with data matrix x and response vector y using the following loop where noise =
new Normal (0, 10 * m * m).
1 for i <- x . indices do
2 x (i , 0) = i
3 y ( i ) = i * i + i + noise . gen
4 end for

Compare the results of Regression vs. quadratic. Compare the Quality of Fit and the parameter
values. What correspondence do the parameters have with the coefficients used to generate the data?
Plot y vs. x, yp and y vs. t for both Regression and quadratic. Also plot the residuals e vs. x for
both. Note, t is the index vector VectorD.range (0, m).

5. Generate a dataset with data matrix x and response vector y using the following loop where noise =
new Normal (0, 10 * s * s) and grid = 1 to s.
1 var k = 0
2 for i <- grid ; j <- grid do
3 x ( k ) = VectorD (i , j )
4 y ( k ) = x (k , 0) ~ ˆ 2 + 2 * x (k , 1) + noise . gen
5 k += 1
6 end for

Compare the results of Regression vs. quadratic. Try modifying the equation for the response and
see how the Quality of Fit changes.

6. The quadratic model as well as its more complex cousin cubic may have issues with having high
multi-collinearity or high VIF values. Although high VIF values may not be a problem for predic-
tion accuracy, they can make interpretation and inferencing difficult. For the problem given in this
section, rather than adding x2 to the existing Regression model, find a second order polynomial that
could be added without causing high VIF values. VIF values are the lowest when column vectors are
orthogonal. See the section on Polynomial Regression for more details.

7. Extrapolation far from the training data can be risky for many types of models. Show how having
higher order polynomial terms in the model can increase this risk.

219
8. A polynomial regression model is said to be hierarchical [143, 167, 127] if it contains all terms up to
xk , e.g., a model with x, x2 , x3 is hierarchical, while a model with x, x3 is not. Show that hierarchical
models are invariant under linear transformations.
Hint: Consider the following two models where x is the distance on I-70 West in miles from the center
of Denver (junction with I-25) and y is the elevation in miles above sea level.

ŷ = b0 + b1 x + b2 x2
ŷ = b0 + b2 x2

The first model is hierarchical, while the second is not. A second study is conducted, but now the
distance z is from the junction of I-70 and I-76. A linear transformation can be used to resolve the
problem.

x = z+7

Putting z into the second model (assuming the first study indicated a linear term is not needed) gives,

ŷ = b0 + b2 (z + 7)2 = (b0 + 49b2 ) + 14b2 z + b2 z 2

but now the linear term is back in the model.

9. Perform quadratic and quadratic (with cross terms) regression on the Example BPressure dataset
using the first two columns of its data matrix x.
1 import E x a m p l e _ B P r e s s u r e .{ x01 = > x , y }

10. Perform both forward selection and backward elimination to find out which of the terms have the most
impact on predicting the response. Which feature selection approach (forward selection or backward
elimination) finds a model with the highest R¯2 ?

11. Generate a dataset with data matrix x and response vector y using the following loop where noise =
new Normal (0, 10 * s * s) and grid = 1 to s.
1 var k = 0
2 for i <- grid ; j <- grid do
3 x ( k ) = VectorD (i , j )
4 y ( k ) = x (k , 0) ~ ˆ 2 + 2 * x (k , 1) + x (k , 0) * x (k , 1) + noise . gen
5 k += 1
6 end for

Compare the results of Regression, quadratic with cross = false, and quadratic with cross =
true.

12. Prove that the number of terms for a quadratic function f (x) in n dimensions is n+2

2 , by decomposing
the function into its quadratic (both squared and cross), linear and constant terms,

| |
f (x) = x Ax + b x + c

220
where A in an n-by-n matrix, b is an n-dimensional column vector and c is a scalar. Hint: A is
symmetric, but the main diagonal is not repeated, and we are looking for unique terms (e.g., x1 x2 and
x2 x1 are treated as the same). Note, when n = 1, A and b become scalars, yielding the usual quadratic
function ax2 + bx + c.

221
6.10 Cubic Regression
The cubic method in the SymbolicRegression object adds cubic terms in addition to the quadratic terms
added by the quadratic method. Linear terms in a model allow for slopes and quadratic terms allow for
curvature. If the curvature changes substantially or there is an inflection point (curvature changes sign), then
cubic terms may be useful. For example, before the inflection point the curve/surface may be concave upward,
while after the point it may be concave downward, e.g., a car stops accelerating and starts decelerating.
When there is only one predictor variable x, the response y is modeled as a cubic function of x.

y = b0 + b1 x + b2 x 2 + b3 x 3 +  (6.85)

6.10.1 Model Equation


In two dimensions (2D) where x = [x1 , x2 ], the cubic regression equation is the following:

y = b · x0 +  = b0 + b1 x1 + b2 x2 + b3 x21 + b4 x22 + b5 x31 + b6 x32 +  (6.86)

where the components of the model equation are defined as follows:

x0 = [1, x1 , x2 , x21 , x22 , x31 , x32 ] expanded input vector


b = [b0 , b1 , b2 , b3 , b4 , b5 , b6 ] parameter/coefficient vector
 = y − b · x0 error/residual

The number of terms (nt) in the model still increases quadratically with the dimensionality of the space
(n) according to the formula for triangular numbers shifted by (n → n + 1) plus n for the cubic terms.
 
n+2 (n + 2)(n + 1)
nt = +n = +n e.g., nt = 8 for n = 2 (6.87)
2 2
When n = 10, the number of terms and corresponding parameters nt = 76, whereas for Regression,
quadratic and quadratic with cross terms and order 2, it would 11, 21 and 66, respectively. Issues related
to negative Degrees of Freedom, over-fitting and multi-collinearity will need careful attention.

6.10.2 Comparison of cubic, quadratic and Regression


This subsection compares the cubic method to the quadratic method and the Regression class.
1 // 8 data points : x y
2 val xy = MatrixD ((8 , 2) , 1 , 2 , // 8 - by -2 combined matrix
3 2 , 11 ,
4 3 , 25 ,
5 4 , 28 ,
6 5 , 30 ,
7 6 , 26 ,
8 7 , 42 ,
9 8 , 60)
10 val (x , y ) = ( xy . not (? , 1) , xy (? , 1) ) // x is first column , y is last column
11 val ox = VectorD . one ( x . dim ) + ˆ : x // prepend a column of all ones
12

222
13 val rg = new Regression ( ox , y ) // create a regression model
14 rg . trainNtest () () // train and test the model
15 println ( rg . summary () ) // show summary
16

17 val qrg = S y m b o l i c R e g r e s s i o n . quadratic (x , y ) // create a quadratic regression model


18 qrg . trainNtest () () // train and test the model
19 println ( qrg . summary () ) // show summary
20

21 val crg = S y m b o l i c R e g r e s s i o n . cubic (x , y ) // create a cubic regression model


22 crg . trainNtest () () // train and test the model
23 println ( crg . summary () ) // show summary

Figure 6.9 shows the predictions (yp) of the Regression, quadratic and cubic models.

y
60

50

40

30

20

10

1 2 3 4 5 6 7 8 x
−10

Figure 6.9: Actual y (red) vs. Regression (green) vs. quadratic (blue) vs. cubic (black)

Notice the quadratic curve follows the linear curve (line), while the cubic curve more closely follows the data.

6.10.3 SymbolicRegression.cubic Method

Class Methods:
1 @ param x the initial data / input m - by - n matrix ( before quadratic term expansion )
2 must not include an intercept column of all ones
3 @ param y the response / output m - vector
4 @ param fname the feature / variable names ( defaults to null )
5 @ param intercept whether to include the intercept term ( column of ones ) _1
6 ( defaults to true )
7 @ param cross whether to include 2 - way cross / interaction terms x_i x_j
8 ( defaults to false )
9 @ param cross3 whether to include 3 - way cross / interaction terms x_i x_j x_k
10 ( defaults to false )
11 @ param hparam the hyper - parameters ( defaults to Regression . hp )

223
12

13 def cubic ( x : MatrixD , y : VectorD , fname : Array [ String ] = null ,


14 intercept : Boolean = true , cross : Boolean = false , cross3 : Boolean = false ,
15 hparam : Hype rParame ter = Regression . hp ) : Regression =
16 val mod = apply (x , y , fname , Set (1 , 2 , 3) , intercept , cross , cross3 , hparam )
17 mod . modelName = " S y m b o l i c R e g r e s s i o n . cubic " + ( if cross then " X " else " " ) +
18 ( if cross3 then " X " else " " )
19 mod
20 end cubic

The Set (1, 2, 3) specifies that first (Linear), second (Quadratic), and third (Cubic) order terms will
be included in the model. The intercept flag indicates whether a column of ones will be added to the
input/data matrix.

6.10.4 Cubic Regression with Cross Terms


The cubic method provides the option of adding 2-way cross/interaction terms (e.g., x2 x1 ) controlled by
the cross flag and/or 3-way cross/interaction terms (e.g., x21 x2 ) controlled by the cross3 flag.

Model Equation

In two dimensions (2D) where x = [x1 , x2 ], the cubic model/regression equation with cross terms is the
following:

y = b · x0 +  = b0 + b1 x1 + b2 x2 + b3 x21 + b4 x22 + b5 x31 + b6 x32 + b7 x1 x2 +  (6.88)


and with cross3 terms is

y = b · x0 +  = b0 + b1 x1 + b2 x2 + b3 x21 + b4 x22 + b5 x31 + b6 x32 + b7 x1 x2 + b8 x21 x2 + b9 x1 x22 +  (6.89)

where the components of the model equation are defined as follows:

x0 = [1, x1 , x2 , x21 , x22 , x31 , x32 , x1 x2 , x21 x2 , x1 x22 ] expanded input vector
b = [b0 , b1 , b2 , b3 , b4 , b5 , b6 , b7 , b8 , b9 ] parameter/coefficient vector
 = y − b · x0 error/residual

Naturally, the number of terms in the model increases cubically with the dimensionality of the space (n)
according to the formula for tetrahedral numbers shifted by (n → n + 1).
 
n+3 (n + 3)(n + 2)(n + 1)
nt = = e.g., nt = 10 for n = 2 (6.90)
3 6
When n = 10, the number of terms and corresponding parameters nt = 286, whereas for Regression,
quadratic, quadratic with cross and cubic with both crosses and order 2, it would 11, 21, 66 and 76,
respectively. Issues related to negative Degrees of Freedom, over-fitting and multi-collinearity will need even
more careful attention.

224
If polynomials of higher degree are needed, ScalaTion provides a couple of means to deal with it. First,
when the data matrix consists of single column and x is one dimensional, the PolyRegression class may
be used. If one or two variables need higher degree terms, the caller may add these columns themselves as
additional columns in the data matrix input into the Regression class. The SymbolicRegression object
described in the next section allows the user to try many function forms.

Categorical Variables and Collinearity

Quadratic and Cubic Regression may fail producing Not-a-Number (NaN) results when a dataset contains
one or more categorical variables. For example, a variable like citizen “no”, “yes” is likely to be encoded 0,
1. If such a column is squared or cubed, the new column will be identical to the original column, so that
they will be perfectly collinear. One solution is not to expand such columns. If one must, then a different
encoding may be used, e.g., 1, 2. See the section on RegressionCat for more details.

6.10.5 Exercises
1. Generate and compare the model summaries produced by the three models (Regression, quadratic
and cubic) applied to the dataset given in this section.

2. An inflection point occurs when the second derivative changes sign. Find the inflection point in the
following cubic equation:

y = f (x) = x3 − 6x2 + 12x − 5

Plot the cubic function to illustrate. Explain why there are no inflection points for quadratic models.

3. Many laws in science involve quadratic and cubic terms as well as the inverses of these terms (e.g.,
inverse square laws). Find such a law and an open dataset to test the law.

4. Perform Cubic and Cubic with cross terms Regression on the Example BPressure dataset using the
first two columns of its data matrix x.
1 import E x a m p l e _ B P r e s s u r e .{ x01 = > x , y }

5. Perform both forward selection and backward elimination to find out which of the terms have the most
impact on predicting the response. Which feature selection approach (forward selection or backward
elimination) finds a model with the highest R¯2 ?

6. Generate a dataset with data matrix x and response vector y using the following loop where noise =
new Normal (0, 10 * s * s) and grid = 1 to s.
1 var k = 0
2 for i <- grid ; j <- grid do
3 x ( k ) = VectorD (i , j )
4 y ( k ) = x (k , 0) ~ ˆ 2 + 2 * x (k , 1) + x (k , 0) * x (k , 1) + noise . gen
5 k += 1
6 end for

225
Compare the results of Regression, quadratic with cross = false, quadratic with cross = true,
cubic with cross = false, cubic with cross = true, cubic with cross = true, cross3 = true,
Try modifying the equation for the response and see how the Quality of Fit changes.

226
6.11 Symbolic Regression
The last two sections covered Quadratic and Cubic Regression, but there are many possible functional forms.
For example, in physics force often decreases with distance following a inverse square law. The Newton’s Law
of Universal Gravitation states that masses m1 and m2 with center of mass positions at p1 and p2 (with
distance r = kp2 − p1 k will attract each other with force f ,

m1 m2
f = G (6.91)
r2
where the gravitational constant G = 6.67408 · 10−11 m3 kg−1 s−2 .

6.11.1 Sample Calculation


Let m1 be the mass of a man (100 kg), m2 be the mass of the Earth (5.97219 · 1024 kg), r be the distance
to the center of the Earth (sea-level) (6.371 · 106 m), then

100 · 5.97219 · 1024 2


f = 6.67408 · 10−11 = 982 kg · m/s (or newtons)
(6.371 · 106 )2
The Calc object in ScalaTion may be used to evaluate the following function, passing in 100 kilograms.
1 def f ( x : Double ) : Double = 6.67408 E -11 * 5.97219 E24 * x / 6.371 E6 ~ ˆ 2

The calculation is performed by the following: runMain scalation.runCalc 100.

6.11.2 As a Data Science Problem


This can be recast a symbolic regression problem using the following renaming (m1 → x0 , m2 → x1 , r →
x2 , f → y).

y = b0 x0 x1 x−2
2 + (6.92)

Given a four column dataset [x0 , x1 , x2 , y] a Symbolic Regression could be run to estimate a more general
model that includes all possible terms with powers x−2 −1 1 2
j , xj , xj , xj . It could also include cross (two-way
interaction) terms between all these terms. In this case, it is necessary to add cross3 (three-way interaction)
terms. An intercept would imply force with no masses involved, so it should be left out of the model.
It is easier to collect data where the Earth is used for mass 1 and mass 2 is for people at various distances
from the center of the Earth (m1 → x0 , r → x1 , f → y).

y = b0 x0 x−2
1 + (6.93)

In this case the parameter b0 will correspond to GM , where G is the Gravitational Constant and M is the
Mass of the Earth. The following code provides simulated data and uses symbolic regression to determine
the Gravitational Constant.
1 val noise = Normal (0 , 10) // random noise
2 val rad = Uniform (6370 , 7000) // distance from center of Earth in km
3 val mas = Uniform (50 , 150) // mass of person
4

5 val M = 5.97219 E24 // mass of Earth in kg


6 val G = 6.67408 E -11 // gravitational const . m ˆ 3 kg ˆ -1 s ˆ -2

227
7

8 val xy = new MatrixD (100 , 3) // simulated gravity data


9 for i <- xy . indices do
10 val m = mas . gen // unit of kilogram ( kg )
11 val r = 1000 * rad . gen // unit of meter ( m )
12 xy (i , 0) = m // mass of person
13 xy (i , 1) = r // radius / distance
14 xy (i , 2) = G * M * m / r ~ ˆ 2 + noise . gen // force of gravity GM m / r ˆ 2
15 end for
16

17 val fname = Array ( " mass " , " radius " )


18

19 println ( s " xy = $xy " )


20 val (x , y ) = ( xy . not (? , 2) , xy (? , 2) )
21

22 banner ( " Newton ’s Universal Gravity Symbolic Regression " )


23 val mod = S y m b o l i c R e g r e s s i o n (x , y , fname , null , false , false ,
24 terms = Array ((0 , 1.0) , (1 , -2.0) ) ) // add one custom term
25

26 mod . trainNtest () () // train and test the model


27 println ( mod . summary () ) // parameter / coefficient statistics
28 println ( s " b = ~ GM = ${ G * m1 } " ) // Gravitational Constant * Earth Mass

The statement val mod = SymbolicRegression (...) invokes the factory method called apply in the
SymbolicRegression object. The SymbolicRegression object provides methods for quadratic, cubic, and
more general symbolic regression.

6.11.3 SymbolicRegression Object

Object Methods:
1 object S y m b o l i c R e g r e s s i o n :
2

3 def apply ( x : MatrixD , y : VectorD , fname : Array [ String ] = null , ...


4 def buildMatrix ( x : MatrixD , fname : Array [ String ] , ...
5 def rescale ( x : MatrixD , y : VectorD , fname : Array [ String ] = null , ...
6 def crossNames ( nm : Array [ String ]) : Array [ String ] =
7 def crossNames3 ( nm : Array [ String ]) : Array [ String ] =
8 def quadratic ( x : MatrixD , y : VectorD , fname : Array [ String ] = null , ...
9 def cubic ( x : MatrixD , y : VectorD , fname : Array [ String ] = null , ...
10

11 end S y m b o l i c R e g r e s s i o n

The apply method is flexible enough to include many functional forms as terms in a model. Feature
selection can be used to eliminate many of the terms to produce a meaningful and interpretable model.
Note, unless measurements are precise and experiments are controlled, other terms besides the one given by
Newton’s of Universal Gravitation are likely to be selected.
1 @ param x the initial data / input m - by - n matrix ( before expansion )
2 must not include an intercept column of all ones
3 @ param y the response / output m - vector
4 @ param fname the feature / variable names ( defaults to null )

228
5 @ param powers the set of powers to raise matrix x to ( defaults to null )
6 @ param intercept whether to include the intercept term ( column of ones ) _1
7 ( defaults to true )
8 @ param cross whether to include 2 - way cross / interaction terms x_i x_j
9 ( defaults to true )
10 @ param cross3 whether to include 3 - way cross / interaction terms x_i x_j x_k
11 ( defaults to false )
12 @ param hparam the hyper - parameters ( defaults to Regression . hp )
13 @ param terms custom terms to add into the model , e . g . ,
14 Array ((0 , 1.0) , (1 , -2.0) ) adds x0 x1 ˆ ( -2)
15

16 def apply ( x : MatrixD , y : VectorD , fname : Array [ String ] = null ,


17 powers : Set [ Double ] = null , intercept : Boolean = true ,
18 cross : Boolean = true , cross3 : Boolean = false ,
19 hparam : Hype rParame ter = Regression . hp ,
20 terms : Array [ Xj2p ]*) : Regression =
21 val fname_ = if fname ! = null then fname
22 else x . indices2 . map ( " x " + _ ) . toArray
23

24 val ( xx , f_name ) = buildMatrix (x , fname_ , powers , intercept , cross , cross3 ,


25 terms : _ *)
26 val mod = new Regression ( xx , y , f_name , hparam )
27 mod . modelName = " S y m b o l i c R e g r e s s i o n " + ( if cross then " X " else " " ) +
28 ( if cross3 then " X " else " " )
29 mod
30 end apply

where type Xj2p = (Int, Double) indicates raising column Xj to the p-th power.

6.11.4 Implementation of the apply Method


The apply method forms an expanded matrix and passes it to the Regression class. The following arguments
control what terms are added to a model:

1. The powers set takes each column in matrix X and raises it to the pth power for every p ∈ powers.
The expression X p produces a matrix with all columns raised to the pth power. For example, Set (1,
2, 0.5) will add the original columns, quadratic columns, and square root columns.

2. The intercept flag indicates whether an intercept (column of ones) is to be added to the model.
Again, such a column must not be included in the original matrix.

3. The cross flag indicates whether two-way cross/interaction terms of the form xi xj (for i 6= j) are to
be added to the model.

4. The cross3 flag indicates whether three-way cross/interaction terms of the form xi xj xk (for i, j, k not
all the same) are to be added to the model.

5. The terms (repeated) array allows custom terms to add into the model. For example,
1 Array ((0 , 1.0) , (1 , -2) )

adds the term x0 x−2


1 to the model. As this argument is repeated (Array [Xj2p]*) due to the star (*),
additional custom terms may be added. The * makes the last argument a vararg.

229
Much of functionality to do this is supplied by the MatrixD class in the mathstat package. The operator
++^ concatenates two matrices column-wise, while operator x~^p returns a new matrix where each of the
columns in the original matrix is raised to the pth power. The crossAll method returns a new matrix
consisting of columns that multiply each column by every other column. The crossAll3 method returns a
new matrix consisting of columns that multiply each column by all combinations of two other columns.

buildMatrix Method

The bulk of the work is done by the buildMatrix method that creates the input data matrix, column by
column.
1 def buildMatrix ( x : MatrixD , fname : Array [ String ] ,
2 powers : Set [ Double ] , intercept : Boolean ,
3 cross : Boolean , cross3 : Boolean ,
4 terms : Array [ Xj2p ]*) : ( MatrixD , Array [ String ]) =
5 val _1 = VectorD . one ( x . dim ) // one vector
6 var xx = new MatrixD ( x . dim , 0) // start empty
7 var fname_ = Array [ String ] ()
8

9 if powers ! = null then


10 if powers contains 1 then
11 xx = xx ++ ˆ x // add linear terms x
12 fname_ = fname
13 end if
14 for p <- powers if p ! = 1 do
15 xx = xx ++ ˆ x ~ ˆ p // add other power x ˆ p terms
16 fname_ ++ = fname . map (( n ) = > s "$n ˆ$p . toInt } " )
17 end for
18 end if
19

20 if terms ! = null then


21 debug ( " buildMatrix " , s " add customer terms = $stringOf ( terms ) } " )
22 var z = _1 . copy
23 var s = " "
24 for t <- terms do
25 for (j , p ) <- t do // x_j to the p - th power
26 z * = x (? , j ) ~ ˆ p
27 s = s + s " x$j ˆ$p . toInt } "
28 end for
29 xx = xx : ˆ + z // add custom term / column t
30 fname_ = fname_ :+ s
31 end for
32 end if
33

34 if cross then
35 xx = xx ++ ˆ x . crossAll // add 2 - way cross x_i x_j
36 fname_ ++ = crossNames ( fname )
37 end if
38

39 if cross3 then
40 xx = xx ++ ˆ x . crossAll3 // add 3 - way cross x_i x_j x_k
41 fname_ ++ = crossNames3 ( fname )
42 end if
43

230
44 if intercept then
45 xx = _1 + ˆ : xx // add intercept term ( _1 )
46 fname_ = Array ( " one " ) ++ fname_
47 end if
48

49 ( xx , fname_ ) // return expanded matrix


50 end buildMatrix

6.11.5 Regularization

Due to fact that symbolic regression may introduce many terms into the model and have high multi-
collinearity, regularization becomes even more important.

Symbolic Ridge Regression

Symbolic Ridge Regression can be beneficial in dealing with multi-collinearity. The SymRidgeRegression
object supports the same methods that SymbolicRegression does, except buildMatrix that it reuses.

1 object S y m R i d g e R e g r e s s i o n :
2

3 @ param x the initial data / input m - by - n matrix ( before expansion )


4 must not include an intercept column of all ones
5 @ param y the response / output m - vector
6 @ param fname the feature / variable names ( defaults to null )
7 @ param powers the set of powers to raise matrix x to ( defaults to null )
8 @ param cross whether to include 2 - way cross / interaction terms x_i x_j
9 ( defaults to true )
10 @ param cross3 whether to include 3 - way cross / interaction terms x_i x_j x_k
11 ( defaults to false )
12 @ param hparam the hyper - parameters ( defaults to R id ge R eg re ss i on . hp )
13 @ param terms custom terms to add into the model , e . g . ,
14 Array ((0 , 1.0) , (1 , -2.0) ) adds x0 x1 ˆ ( -2)
15

16 def apply ( x : MatrixD , y : VectorD , fname : Array [ String ] = null ,


17 powers : Set [ Double ] = null , cross : Boolean = true , cross3 : Boolean = false ,
18 hparam : Hype rParame ter = Ri dg e Re gr e ss io n . hp ,
19 terms : Array [ Xj2p ]*) : Ri d ge Re gr e ss io n =
20 val fname_ = if fname ! = null then fname
21 else x . indices2 . map ( " x " + _ ) . toArray // default names
22

23 val ( xx , f_name ) = S y m b o l i c R e g r e s s i o n . buildMatrix (x , fname_ , powers ,


24 false , cross , cross3 , terms : _ *)
25 // val mod = new Ri dg eR e gr es si o n ( xx , y , f_name , hparam ) // user centers
26 val mod = R id ge Re g re ss io n . center ( xx , y , f_name , hparam ) // auto . centers
27 mod . modelName = " S y m R i d g e R e g r e s s i o n " + ( if cross then " X " else " " ) +
28 ( if cross3 then " XX " else " " )
29 mod
30 end apply

It requires the data to be centered and has no intercept (see exercises).

231
Symbolic Lasso Regression

Other forms of regularization can be useful as well. Symbolic Lasso Regression can be beneficial in dealing
with multi-collinearity and more importantly by setting some parameters/coefficients bj to zero, thereby
eliminating the j th term. This is particularly important for symbolic regression as the number of possible
terms can become very large.
1 object S y m L a s s o R e g r e s s i o n :
2

3 @ param x the initial data / input m - by - n matrix ( before expansion )


4 must not include an intercept column of all ones
5 @ param y the response / output m - vector
6 @ param fname the feature / variable names ( defaults to null )
7 @ param intercept whether to include the intercept term ( column of ones ) _1 ( defaults to
true )
8 @ param powers the set of powers to raise matrix x to ( defaults to null )
9 @ param cross whether to include 2 - way cross / interaction terms x_i x_j ( defaults to
true )
10 @ param cross3 whether to include 3 - way cross / interaction terms x_i x_j x_k ( defaults
to false )
11 @ param hparam the hyper - parameters ( defaults to La ss oR e gr es si o n . hp )
12 @ param terms custom terms to add into the model , e . g . , Array ((0 , 1.0) , (1 , -2.0) )
13 adds x0 x1 ˆ ( -2)
14

15 def apply ( x : MatrixD , y : VectorD , fname : Array [ String ] = null ,


16 powers : Set [ Double ] = null , intercept : Boolean = true ,
17 cross : Boolean = true , cross3 : Boolean = false ,
18 hparam : Hype rParame ter = La ss o Re gr e ss io n . hp ,
19 terms : Array [ Xj2p ]*) : La s so Re gr e ss io n =
20 val fname_ = if fname ! = null then fname
21 else x . indices2 . map ( " x " + _ ) . toArray // default names
22

23 val ( xx , f_name ) = S y m b o l i c R e g r e s s i o n . buildMatrix (x , fname_ , powers , intercept ,


24 cross , cross3 , terms : _ *)
25 val mod = new L as so R eg re ss i on ( xx , y , f_name , hparam )
26 mod . modelName = " S y m L a s s o R e g r e s s i o n " + ( if cross then " X " else " " ) +
27 ( if cross3 then " XX " else " " )
28 mod
29 end apply

6.11.6 Exercises
1. Exploratory Data Analysis Revisited. For each predictor variable xj in the Example AutoMPG
dataset, determine the best power to raise that column to. Plot y and yp versus xj for SimpleRegression.
Compare this to the plot of y and yp versus xj for SymbolicRegression using the best power.

2. Combine all the best powers together to form a model matrix with the same number of columns as
the original AutoMPG matrix and compare SymbolicRegression with Regression on the original
matrix.

3. Use forward, backward and stepwise regression to look for a better (than the last exercise) combination
of features for the AutoMPG dataset.

232
4. Redo the last exercise using SymRidgeRegression. Note any differences.

5. Redo the last exercise using SymLassoRegression. Note any differences.

6. When there are for example quadratic terms added to the expanded matrix, explain why it will not
work to simply center (by subtracting the column means) the original data matrix X.

7. Compare the effectiveness of the following two search strategies that are used in Symbolic Regression:
(a) Genetic Algorithms and (b) FFX Algorithm.

8. Present a review of a paper that discusses how Symbolic Regression has been used to reproduce a
theory in a scientific discipline.

233
6.12 Transformed Regression
The TranRegression class supports transformed multiple linear regression and hence, the predictor vector
x is multi-dimensional [1, x1 , ...xk ]. In certain cases, the relationship between the response scalar y and the
predictor vector x is not linear. There are many possible functional relationships that could apply [144], but
five obvious choices are the following:

1. The response grows exponentially versus a linear combination of the predictor variable.

2. The response grows quadratically versus a linear combination of the predictor variable.

3. The response grows as the square root of a linear combination of the predictor variable.

4. The response grows logarithmically versus a linear combination of the predictor variable.

5. The response grows inversely (as the reciprocal) versus a linear combination of the predictor variable.

The capability can be easily implemented by introducing a transform (transformation function) into Regression.
The transformation function and its inverse are passed into the TranRegression class which extends the
Regression class.