CME 250:
Introduction to Machine Learning
Lecture 2:
Linear and Logistic Regression
Sherrie Wang
[email protected] CME 250: Introduction to Machine Learning, Winter 2019 1
Slides are online at
Agenda cme250.stanford.edu
• Bias-variance trade-off
• Linear regression
- Simple linear regression
- What is a “good” fit?
- Multiple linear regression
- Variations on linear regression
• Logistic regression
- What is a “good” fit?
CME 250: Introduction to Machine Learning, Winter 2019 2
Recall: Types of Machine Learning
Do you have
labeled data?
Yes No
Supervised Unsupervised
What do you want
to predict?
Category Quantity
Classification Regression
CME 250: Introduction to Machine Learning, Winter 2019 3
Bias and Variance
CME 250: Introduction to Machine Learning, Winter 2019 4
Assessing Model Performance
There are a number of metrics used to assess model performance on
supervised tasks (regression and classification).
Key point: We want to know how good predictions are when we
apply our method to previously unseen data.
Why? The ability to generalize to unseen data is what makes these
methods useful.
CME 250: Introduction to Machine Learning, Winter 2019 5
Datasets
Training data
• Observations used to learn the model
Validation data
• Observations used to estimate error for parameter-tuning or model
selection
Test data
• Observations used to measure performance on unseen data (how well the
model generalizes)
• Not available to the algorithm during any part of the learning process
CME 250: Introduction to Machine Learning, Winter 2019 6
Assessing Model Performance
We want a method that gives high test set performance, or low
test error.
What about high training set performance / low training set error?
There is no guarantee that the method with the highest training
performance will have the highest test performance.
CME 250: Introduction to Machine Learning, Winter 2019 7
Model Flexibility
Test error
Training error
FIGURE 2.9, ISL (8th printing 2017)
CME 250: Introduction to Machine Learning, Winter 2019 8
Model Flexibility
As model flexibility increases, training error will decrease.
The model can fit more and more of the variance in the training set.
Some of this variance, however, may be noise.
Therefore the test error may or may not decrease.
If training error is much larger than test error, the model is overfitting.
CME 250: Introduction to Machine Learning, Winter 2019 9
Model Flexibility
Test error
Training error
FIGURE 2.10, ISL (8th printing 2017)
CME 250: Introduction to Machine Learning, Winter 2019 10
Bias and Variance
Expected test error = Variance + Bias2 + another term
Bias = error caused by simplifying assumptions built into the model
Variance = how much the learned function will change if trained on a
different training set
CME 250: Introduction to Machine Learning, Winter 2019 11
Bias-Variance Trade-off
Generally, more flexible methods have more variance and less bias.
Less flexible methods have more bias and less variance.
The best method for a task will balance the two types of error to
achieve the lowest test error.
CME 250: Introduction to Machine Learning, Winter 2019 12
Bias-Variance Trade-off
Dataset #1 Dataset #2 Dataset #3
FIGURE 2.12, ISL (8th printing 2017)
CME 250: Introduction to Machine Learning, Winter 2019 13
Supervised Algorithm #2:
Linear Regression
CME 250: Introduction to Machine Learning, Winter 2019 14
Linear Regression
Simple supervised learning method, used to predict quantitative
output values.
• Many machine learning methods are generalizations of linear
regression.
• Illustrates key concepts in supervised learning while maintaining
interpretability.
CME 250: Introduction to Machine Learning, Winter 2019 15
Simple Linear Regression
Predict a quantitative response Y on the basis of a single predictor
variable X.
Assumes there is an approximately linear relationship between X and Y.
and are the coefficients or parameters of the linear model. In
this case they represent the intercept and slope terms of a line.
CME 250: Introduction to Machine Learning, Winter 2019 16
Simple Linear Regression
Estimate βs using training data: (x(1), y(1)), (x(2), y(2)), … (x(n), y(n))
Once βs are estimated, we denote them using “hats”.
For a particular realization of X, aka X = x, the predicted output is
denoted “y-hat”:
Goal: Pick , such that the model is a good fit to the training data.
CME 250: Introduction to Machine Learning, Winter 2019 17
Simple Linear Regression
Advertising
dataset
FIGURE 3.1, ISL (8th printing 2017)
CME 250: Introduction to Machine Learning, Winter 2019 18
Simple Linear Regression
Two related questions:
• How do we estimate the coefficients? (aka “fit the model”)
• What is a “good fit” to the data?
CME 250: Introduction to Machine Learning, Winter 2019 19
Least Squares
Typically, how well a linear model is fit to the data is measured using
least squares.
Residual for the i-th sample:
Residual sum of squares (RSS):
CME 250: Introduction to Machine Learning, Winter 2019 20
Least Squares
The model fit using least squares finds , that minimize the RSS.
Recall that the extrema of a function can be found by setting its
derivative to zero, and verified to be minima via the second derivative.
sample means
CME 250: Introduction to Machine Learning, Winter 2019 21
Least Squares in Pictures
RSS is the sum of the
squares of all vertical
gray lines.
As we vary the βs,
RSS changes. Least
squares finds βs that
minimize RSS.
FIGURE 3.1, ISL (8th printing 2017) FIGURE 3.2, ISL (8th printing 2017)
CME 250: Introduction to Machine Learning, Winter 2019 22
How good is the model fit?
Linear regression is typically assessed using two related metrics:
• Residual standard error (RSE)
higher RSE
means worse fit
CME 250: Introduction to Machine Learning, Winter 2019 23
How good is the model fit?
Linear regression is typically assessed using two related metrics:
• Residual standard error (RSE)
higher RSE
means worse fit
TSS is the “total
sum of squares”
• R2 statistic
higher R2
means better fit
R2 measures the proportion of variability in Y explained by X.
CME 250: Introduction to Machine Learning, Winter 2019 24
What range of values can R2 take?
CME 250: Introduction to Machine Learning, Winter 2019 25
What range of values can R2 take?
CME 250: Introduction to Machine Learning, Winter 2019 26
How good is the model fit?
What values of RSE and R2 are “good”?
Depends on the domain and application.
• In physics, we may know the data comes from a linear model. In
such a case, we’d require R close to 1 in order to call the fit good.
2
• In biology, social sciences, and other domains, a linear model may
be a crude approximation. Existing models may also not be good at
estimating Y, so a model with R2 of 0.4 may be considered good.
CME 250: Introduction to Machine Learning, Winter 2019 27
Multiple Linear Regression
What if our dataset contains multiple input dimensions Xj?
We could run a simple linear regression for each input dimension.
Is this satisfactory?
CME 250: Introduction to Machine Learning, Winter 2019 28
Multiple Linear Regression
Predict the response variable using more than one predictor variable.
Here, Xj represents the j-th predictor.
We interpret βj as the average effect on Y of a 1 unit increase in Xj,
holding all other predictors fixed.
CME 250: Introduction to Machine Learning, Winter 2019 29
Multiple Linear Regression
3 Simple Regressions 1 Multiple Regression
CME 250: Introduction to Machine Learning, Winter 2019 30
Multiple Linear Regression
3 Simple Regressions 1 Multiple Regression
Variable Correlations
CME 250: Introduction to Machine Learning, Winter 2019 31
Multiple Linear Regression
FIGURE 3.4, ISL (8th printing 2017)
CME 250: Introduction to Machine Learning, Winter 2019 32
Multiple Linear Regression
Another way to write
is to use matrix notation.
CME 250: Introduction to Machine Learning, Winter 2019 33
Multiple Linear Regression
Define:
Then:
CME 250: Introduction to Machine Learning, Winter 2019 34
Least Squares
The model fit using least squares finds that minimize the RSS.
Notation:
square of L2-norm
of the vector inside ||
The normal equations give us the analytical solution for .
CME 250: Introduction to Machine Learning, Winter 2019 35
How good is the model fit?
Multiple linear regression can also be assessed using the 2 metrics
previously discussed.
• Residual standard error (RSE)
• R2 statistic
CME 250: Introduction to Machine Learning, Winter 2019 36
Qualitative Inputs
How do we include qualitative inputs in regression?
• E.g. predictor is “gender”: “male” or “female”
If only two possible values, we can create a dummy variable
CME 250: Introduction to Machine Learning, Winter 2019 37
Qualitative Inputs
If more than two possible values, introduce multiple dummy variables
• E.g. predictor is “eye color”: “blue”, “green”, or “brown”
CME 250: Introduction to Machine Learning, Winter 2019 38
Potential Problems
• Non-linearity of response-predictor relationships
• Non-additivity of predictors
• Non-constant variance of error terms
• Outliers can mislead model performance
• High-leverage points overly influence βs
• Correlation of error terms
• Collinearity
CME 250: Introduction to Machine Learning, Winter 2019 39
Non-linearity of Data
Residual plots can help
diagnose if this is an issue.
Plot residuals vs. fitted
values.
If there is a pattern in the
residual plot, then the
linearity assumption is
suspect. FIGURE 3.11, ISL (8th printing 2017)
CME 250: Introduction to Machine Learning, Winter 2019 40
Non-linearity of Data
A simple way to extend
linear regression to model
non-linear relationships is
via polynomial regression.
This is still a linear model; it FIGURE 3.11, ISL (8th printing 2017)
can be fit via least squares.
CME 250: Introduction to Machine Learning, Winter 2019 41
Non-additivity of Predictors
Additivity means each predictor Xj affects Y independently of the value
of other predictors.
If this is not true, we can extend linear regression by including
interaction effects.
CME 250: Introduction to Machine Learning, Winter 2019 42
Non-constant Variance of Error
Can be seen in a funnel
shape in the residual plot.
Standard errors, confidence
intervals, and hypothesis tests
rely on constant variance
assumption.
One possible solution is to
transform the response Y
using a concave function (log,
square root). FIGURE 3.1, ISL (8th printing 2017)
CME 250: Introduction to Machine Learning, Winter 2019 43
Outliers and High-leverage Points
Outliers have unusual Y
values, and can inflate RSE
and deflate R2.
High-leverage points have
unusual values for
predictors X, and influence
least squares line
substantially.
CME 250: Introduction to Machine Learning, Winter 2019 44
Other Questions
Not addressed in this course, but important to consider:
• Does the data contain evidence of a relationship between X and Y?
• Are the estimated coefficients close to the true ones?
Future lecture:
• If we have many predictors Xj, which one(s) do we include in the
model?
CME 250: Introduction to Machine Learning, Winter 2019 45
Linear Regression Summary
Advantages:
• Simple model
• Interpretable coefficients
• Can obtain good results with small datasets
Disadvantages:
• Model may be too simple to make accurate predictions over large range of values
• Sensitive to outliers in data due to minimization of squared error
CME 250: Introduction to Machine Learning, Winter 2019 46
Linear Regression in sklearn
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X, y)
linreg.coef_ # coefficients of input
linreg.intercept_ # model intercept
y_hat = linreg.predict(y)
CME 250: Introduction to Machine Learning, Winter 2019 47
Assessment Metrics in sklearn
from sklearn.metrics import r2_score,
mean_squared_error
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
CME 250: Introduction to Machine Learning, Winter 2019 48
Supervised Algorithm #3:
Logistic Regression
CME 250: Introduction to Machine Learning, Winter 2019 49
Classification
Classification methods are used to
predict qualitative output values.
• Assign each observation to a class /
category
• e.g. K-nearest neighbor classifier from
Lecture 1
FIGURE 4.1, ISL (8th printing 2017)
CME 250: Introduction to Machine Learning, Winter 2019 50
Logistic Regression
Binary classification: Y takes on two values, 0 or 1, corresponding to 2
classes.
Logistic regression models binary classification as
• Set threshold to obtain class decisions
• Extension of linear regression for probabilities in [0, 1]
CME 250: Introduction to Machine Learning, Winter 2019 51
Linear regression?
Why don’t we use this expression to model probability?
CME 250: Introduction to Machine Learning, Winter 2019 52
Logistic Regression
Logistic (sigmoid) function bounds output
Logistic function
• S-shaped curve
• Always takes values in (0, 1), which are valid probabilities
Logistic regression model:
CME 250: Introduction to Machine Learning, Winter 2019 53
Logistic Regression
https://www.saedsayad.com/logistic_regression.htm
CME 250: Introduction to Machine Learning, Winter 2019 54
Logistic Regression
CME 250: Introduction to Machine Learning, Winter 2019 55
Logistic Regression
CME 250: Introduction to Machine Learning, Winter 2019 56
Logistic Regression
CME 250: Introduction to Machine Learning, Winter 2019 57
Logistic Regression
Threshold
CME 250: Introduction to Machine Learning, Winter 2019 58
Logistic Regression
FIGURE 4.2, ISL (8th printing 2017)
CME 250: Introduction to Machine Learning, Winter 2019 59
Estimating Coefficients (β)
Recall that for linear regression, βs are estimated using least squares
on the training data.
For logistic regression, there is no closed form solution for βs
obtainable by taking the derivative and setting to zero.
Instead, βs are estimated using maximum likelihood estimation.
CME 250: Introduction to Machine Learning, Winter 2019 60
Multiple Logistic Regression
We can extend logistic regression to the case of multiple predictor
variables.
CME 250: Introduction to Machine Learning, Winter 2019 61
Multiple Logistic Regression
We can extend logistic regression to the case of multiple predictor
variables.
https://florianhartl.com/logistic-regression-geometric-intuition.html
CME 250: Introduction to Machine Learning, Winter 2019 62
How good is the model fit?
Classifier performance can be
summarized in a table called the Predicted class
confusion matrix.
0 1
True class
• “Good performance” is when True Positive False Negative
0 (TP) (FN)
TP, TN large and FP, FN small
False Positive True Negative
1 (FP) (TN)
• Can be computed for training,
validation, and test sets. Test
set informs you about model
generalizability.
CME 250: Introduction to Machine Learning, Winter 2019 63
How good is the model fit?
Some common metrics to assess
performance: Predicted class
• Accuracy: (TP + TN) / n 0 1
True class
True Positive False Negative
• Recall: TP / (TP + FN) 0 (TP) (FN)
False Positive True Negative
• Precision: TP / (TP + FP) 1 (FP) (TN)
• Specificity: TN / (FP + TN)
• False positive rate: FP / (FP + TN)
CME 250: Introduction to Machine Learning, Winter 2019 64
How good is the model fit?
ROC (receiver operating characteristic)
curve summarizes the trade-off between
recall and false positive rate. Area under
the curve summarizes in one metric.
Many classifiers have a “knob” or
threshold that can be adjusted to make
the classifier more or less conservative
in predicting Y = 1.
Trade-off: more true positive (TP) —>
more false positives (FP)
CME 250: Introduction to Machine Learning, Winter 2019 65
Logistic Regression Summary
Advantages:
• Extension of linear regression, simple
• Interpretable: log-odds are linear in predictors
• No hyperparameters to tune
Disadvantages:
• Cannot model complex decision boundaries
CME 250: Introduction to Machine Learning, Winter 2019 66
Dataset Splitting in sklearn
import pandas as pd
from sklearn.model_selection import train_test_split
data = pd.read_csv(‘dataset.csv’)
X = data[‘input’]
y = data[‘output’]
X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size=0.2)
CME 250: Introduction to Machine Learning, Winter 2019 67
Logistic Regression in sklearn
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
CME 250: Introduction to Machine Learning, Winter 2019 68
Assessment Metrics in sklearn
from sklearn.metrics import accuracy_score,
precision_score, recall_score, confusion_matrix
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
CME 250: Introduction to Machine Learning, Winter 2019 69