0% found this document useful (0 votes)
19 views53 pages

M02 Regression

The document outlines the INFO6105 Data Science Engineering course for Fall 2024, covering various modules including Linear Algebra, Neural Networks, and Ensemble Learning. It details the concepts of supervised learning, linear regression, gradient descent, and the differences between machine learning and statistics. Additionally, it discusses regularization techniques like Lasso and Ridge regression, emphasizing their applications in model accuracy and feature selection.

Uploaded by

Jerry Deku
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views53 pages

M02 Regression

The document outlines the INFO6105 Data Science Engineering course for Fall 2024, covering various modules including Linear Algebra, Neural Networks, and Ensemble Learning. It details the concepts of supervised learning, linear regression, gradient descent, and the differences between machine learning and statistics. Additionally, it discusses regularization techniques like Lasso and Ridge regression, emphasizing their applications in model accuracy and feature selection.

Uploaded by

Jerry Deku
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

INFO6105

Data Science Engineering


Methods and Tools
Fall 2024
Email: [Link]@[Link]
Overview
• Module 1 — Linear Algebra, Probability and Statistics
• Module 2 — Linear Regression, Gradient Descent
• Module 3 — Neural Networks
• Module 4 — Decision Trees
• Module 5 — Ensemble Learning
• Module 6 — Instance Based Learning
• Module 7 — Kernel Methods and SVMs
Module 2: Linear Regression, Gradient
Descent

1. Supervised Learning & AI


2. Linear Regression
3. Gradient Descent

3
Artificial Intelligence (AI)

[Link]
Types of machine learning tasks
• Supervised: correct output known for each training example
– Learn to predict output when given an input vector
• Classification: 1-of-N output (speech recognition, object recognition, medical diagnosis)
• Regression: real-valued output (predicting market prices, customer rating)
• Unsupervised learning
– Create an internal representation of the input, capturing regularities /
structure in data
– Examples: form clusters; extract features
• How do we know if a representation is good?
• Reinforcement learning
– Learn action to maximize payoff
– Not much information in a payoff signal
– Payoff is often delayed
[Link]
Many learning algorithms for different tasks
1. Classification: Determine which discrete category the example is
2. Recognizing patterns: Speech Recognition, facial identity, etc
3. Recommender Systems: Noisy data, commercial pay-off (e.g., Amazon,Netflix).
4. Information retrieval: Find documents or images with similar content
5. Computer vision: detection, segmentation, depth estimation, optical flow, etc
6. Robotics: perception, planning, etc
7. Learning to play games
8. Recognizing anomalies: Unusual sequences of credit card transactions, panic
situation at an airport
9. Spam filtering, fraud detection: The enemy adapts so we must adapt too
10. Many more!

[Link]
Classification

[Link]
Recognizing patterns

[Link]
Recommender Systems
Information retrieval
Computer Vision

[Link]
Robotics

[Link]
Play video games

[Link]
Play games: Alpha Go [Link]

[Link]
Machine Learning vs Data Mining
• Data-mining: Typically using very simple machine learning techniques
on very large databases because computers are too slow to do
anything more interesting with ten billion examples
• Previously used in a negative sense
• misguided statistical procedure of looking for all kinds of relationships in the
data until finally find one
• Now lines are blurred: many ML problems involve tons of data
• But problems with AI flavor (e.g., recognition, robot navigation) still
domain of ML

[Link]
Machine Learning vs Statistics
• ML uses statistical theory to build models
• A lot of ML is rediscovery of things statisticians already knew; often
disguised by differences in terminology
• But the emphasis is very different:
• Good piece of statistics: Clever proof that relatively simple estimation procedure is
asymptotically unbiased.
• Good piece of ML: Demo that a complicated algorithm produces impressive results
on a specific task.
• Can view ML as applying computational techniques to statistical problems.
But go beyond typical statistics problems, with different aims

[Link]
Why Machine Learning
• The essence of machine learning:
– We are sure there is a pattern.
– We can’t pin down exactly the equations for the pattern.
– We have data.
Recognize my Mom among all humans: Credit card approval?
- I am sure I can recognize my mom by how she looks.
- I can’t describe exactly the facial patterns to you.
- I have lots of photos of my mom’s face.
Recommend a movie I like:
- How others like this movie and how you like other movies
have a pattern.
- I can’t describe exactly this pattern.
- I have lots of movie watching data.
17
Supervised Learning
• The machine learning task of
learning a function that maps
an input to an output based
on example input-output
pairs.
• Classification
• Regression

Loan/Credit card approval:


Data: Input x1, x2, x3…Output y
SL is to learn a function f, that maps x to y.
Supervised Learning
• Linear regression
• Logistic regression
• Decision trees
• Support-vector machines
• K-nearest neighbor algorithm
• Neural networks (Multilayer
perceptron)
•…

[Link]
Covariance, Correlation and Regression
Coefficient
• Covariance is the
movement of two
random variables
moving around their
own means.
• Variance is a special
case of covariance.
Correlation
• Correlation is the standardized
value of covariance.
• The standardized values are
bound between -1 and 1.
• X and Y move in the same
direction, correlation is positive.
• X and Y move in opposite
directions, correlation is negative.
• No correlation: Cor(X, Y) = 0.
Pearson correlation coefficient of x and y

[Link]
Regression Coefficient
Exercise:
• X=[1,2,3,4,5,6]
• Y=[1.8,3.2,4.1,6.5,5.8,7.2]
• Calculate the slope for the regression model of Y on X: Use the
above formulae.
• X on Y?
Hands-on • import [Link] as plt
import numpy as np
• import mglearn
x=[Link]([1,2,3,4,5,6])
y=[Link]([1.8,3.2,4.1,6.5,5.8,7.2]) • X, y = [Link].make_wave(n_samples=40)
cov_matrix = [Link](x,y) • [Link](X, y, 'o')
print(cov_matrix)
• [Link](-3, 3)
Out[22]: • [Link]("Feature")
array([[3.5 , 3.72 ], • [Link]("Target")
[3.72 , 4.33866667]])
• [Link]()
reg_coeff= cov_matrix[0,1]/cov_matrix[0,0]
z = [Link](x, y, 1)

In [34]: reg_coeff
Out[34]: 1.062857142857143
In [36]: z
Out[36]: array([1.06285714, 1.04666667])
Hands-on
• from sklearn.linear_model import LinearRegression
• from sklearn.model_selection import train_test_split
• import mglearn
• X, y = [Link].make_wave(n_samples=60)
• X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
• lr = LinearRegression().fit(X_train, y_train)
• print("lr.coef_: {}".format(lr.coef_))
• print("lr.intercept_: {}".format(lr.intercept_))

• print("Training set score: {:.2f}".format([Link](X_train, y_train)))


• print("Test set score: {:.2f}".format([Link](X_test, y_test)))
Linear Regression Hypothesis Testing
• X and Y are random variables with unknown joint distribution.
• 𝛽መ is a random variable, normally distributed or t-distributed with
df=n−k−1
• k is the number of predictor variables. k = 1.
Least Squares
1. from sklearn import datasets
2. import [Link] as sm
3. from scipy import stats

4. diabetes = datasets.load_diabetes()
5. X = [Link]
6. y = [Link]

7. X2 = sm.add_constant(X)
8. est = [Link](y, X2)
9. est2 = [Link]()
10. print([Link]())
Linear Regression
Cost (Loss) function
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import mglearn
import numpy as np
X, y = [Link].make_wave(n_samples=60)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
theta = [Link](X_train.T@X_train)@X_train.T@y_train

X_train2=np.c_[[Link](len(X_train)), X_train]

In [57]: theta
Out[57]: array([-0.03180434, 0.39390555])

lr.coef_: [0.39390555] lr.coef_: [0.39390555]


lr.intercept_: -0.031804343026759746 lr.intercept_: -0.031804343026759746
Gradient decent
Gradient decent vs Stochastic gradient decent
Linear regression
Order of the Polynomial
k=2
• k=0 → constant
• k=1 → line k=1
• k=2 → parabola
• … k=0

Cost
• k=8?

Underfitting vs Overfitting

Size
Cross validation

Testing Error
Training Error

Training Error
Bias vs Variance

[Link]
Regression: statistics or machine learning
• Machine learning: learn a function
• Statistics: learn a distribution
Likelihood
• Maximize log likelihood:
What does it mean?
• The least-squares regression corresponds to finding the maximum
likelihood estimate of θ.
• The target y follows Gaussian distribution.

[Link]
Logistic Regression
• This is Classification.
• Y is usually called “Label”.

Logistic or Sigmoid Function

[Link]
Logistic Function
• Maximum log likelihood:
Regularization
• Simple Linear Regression

• Minimize the residual sum of squares

• RSS=

• Regularization is to avoid overfitting by


penalizing high-valued regression
coefficients.
[Link]
Lasso: Least Absolute Shrinkage and Selection
• Lasso performs L1 regularization

• Where λ is the tuning parameter for the L1 penalty.


• λ = 0 – no coefficients are eliminated; result is the same as least squares
regression
• λ = ∞ – all coefficients are eliminated
• As λ increases, bias increases
• As λ decreases, variance increases
• L1 can reduce coefficients to zero and eliminate the variable.
Ridge regression
• Implements L2 regularization

• Where λ is the tuning parameter for the L2 penalty.


• λ = 0 – no coefficients are eliminated; result is the same as least squares
regression
• λ = ∞ – all coefficients are reduced and approach zero
• As λ increases, bias increases
• As λ decreases, variance increases
• All coefficients are shrunk by the same factor (none are eliminated).
Lasso vs. Ridge Regression
• Lasso method produces a
simpler and more
interpretable model.
• Lasso can reduce coefficients
to zero, performing feature
selection and eliminating
variables of no value.
• Lasso is generally more
accurate compared to Ridge
regression.
• Cross-validation shall be used.
Lasso exercise
• X, y = [Link].load_extended_boston()
• X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
• lr = LinearRegression().fit(X_train, y_train)
• print("Training set score: {:.2f}".format([Link](X_train, y_train)))
• print("Test set score: {:.2f}".format([Link](X_test, y_test)))
• from sklearn.linear_model import Lasso Training set score: 0.95
• lasso = Lasso().fit(X_train, y_train) Test set score: 0.61
• print("Training set score: {:.2f}".format([Link](X_train, y_train)))
Training set score: 0.29
• print("Test set score: {:.2f}".format([Link](X_test, y_test))) Test set score: 0.21
• print("Number of features used: {}".format([Link](lasso.coef_ != 0))) Number of features used: 4
Lasso exercise
• lasso001 = Lasso(alpha=0.01, max_iter=100000).fit(X_train, y_train)
• print("Training set score: {:.2f}".format([Link](X_train, y_train)))
• print("Test set score: {:.2f}".format([Link](X_test, y_test)))
• print("Number of features used: {}".format([Link](lasso001.coef_ != 0)))

• lasso00001 = Lasso(alpha=0.0001, max_iter=100000).fit(X_train, y_train)


• print("Training set score: {:.2f}".format([Link](X_train, y_train)))
• print("Test set score: {:.2f}".format([Link](X_test, y_test)))
• print("Number of features used: {}".format([Link](lasso00001.coef_ != 0)))
• [Link](lasso.coef_, 's', label="Lasso
alpha=1")
• [Link](lasso001.coef_, '^', label="Lasso
alpha=0.01")
• [Link](lasso00001.coef_, 'v',
label="Lasso alpha=0.0001")
• [Link](ncol=2, loc=(0, 1.05))
• [Link](-25, 25)
• [Link]("Coefficient index")
• [Link]("Coefficient magnitude")
• [Link]()
Values of Lambda
Non-zero coefficients
• Through cross-validation, we
can determine the value of λ
λmin λmax
that minimizes the out-of-
sample loss.

[Link]
[Link]/en/latest/glmnet_vignette.html
Log(λ)
Generalized Linear Models
• Regression
• Classification
η: The natural parameter
• The exponential family T(y): The sufficient statistic
a(η): log partition function

McCullagh and Nelder, Generalized Linear Models (2nd ed.)


Summary
• Relationship between AI, ML, and Supervised Learning.
• Linear Regression from probability and machine learning perspective.
• Gradient Descent one method to learn the function.
• Linear in terms of coefficients.
• Bias vs Variance.
• Regularization.
• Generalized Linear Method.

You might also like