INFO6105
Data Science Engineering
Methods and Tools
Fall 2024
Email: [Link]@[Link]
Overview
• Module 1 — Linear Algebra, Probability and Statistics
• Module 2 — Linear Regression, Gradient Descent
• Module 3 — Neural Networks
• Module 4 — Decision Trees
• Module 5 — Ensemble Learning
• Module 6 — Instance Based Learning
• Module 7 — Kernel Methods and SVMs
Module 2: Linear Regression, Gradient
Descent
1. Supervised Learning & AI
2. Linear Regression
3. Gradient Descent
3
Artificial Intelligence (AI)
[Link]
Types of machine learning tasks
• Supervised: correct output known for each training example
– Learn to predict output when given an input vector
• Classification: 1-of-N output (speech recognition, object recognition, medical diagnosis)
• Regression: real-valued output (predicting market prices, customer rating)
• Unsupervised learning
– Create an internal representation of the input, capturing regularities /
structure in data
– Examples: form clusters; extract features
• How do we know if a representation is good?
• Reinforcement learning
– Learn action to maximize payoff
– Not much information in a payoff signal
– Payoff is often delayed
[Link]
Many learning algorithms for different tasks
1. Classification: Determine which discrete category the example is
2. Recognizing patterns: Speech Recognition, facial identity, etc
3. Recommender Systems: Noisy data, commercial pay-off (e.g., Amazon,Netflix).
4. Information retrieval: Find documents or images with similar content
5. Computer vision: detection, segmentation, depth estimation, optical flow, etc
6. Robotics: perception, planning, etc
7. Learning to play games
8. Recognizing anomalies: Unusual sequences of credit card transactions, panic
situation at an airport
9. Spam filtering, fraud detection: The enemy adapts so we must adapt too
10. Many more!
[Link]
Classification
[Link]
Recognizing patterns
[Link]
Recommender Systems
Information retrieval
Computer Vision
[Link]
Robotics
[Link]
Play video games
[Link]
Play games: Alpha Go [Link]
[Link]
Machine Learning vs Data Mining
• Data-mining: Typically using very simple machine learning techniques
on very large databases because computers are too slow to do
anything more interesting with ten billion examples
• Previously used in a negative sense
• misguided statistical procedure of looking for all kinds of relationships in the
data until finally find one
• Now lines are blurred: many ML problems involve tons of data
• But problems with AI flavor (e.g., recognition, robot navigation) still
domain of ML
[Link]
Machine Learning vs Statistics
• ML uses statistical theory to build models
• A lot of ML is rediscovery of things statisticians already knew; often
disguised by differences in terminology
• But the emphasis is very different:
• Good piece of statistics: Clever proof that relatively simple estimation procedure is
asymptotically unbiased.
• Good piece of ML: Demo that a complicated algorithm produces impressive results
on a specific task.
• Can view ML as applying computational techniques to statistical problems.
But go beyond typical statistics problems, with different aims
[Link]
Why Machine Learning
• The essence of machine learning:
– We are sure there is a pattern.
– We can’t pin down exactly the equations for the pattern.
– We have data.
Recognize my Mom among all humans: Credit card approval?
- I am sure I can recognize my mom by how she looks.
- I can’t describe exactly the facial patterns to you.
- I have lots of photos of my mom’s face.
Recommend a movie I like:
- How others like this movie and how you like other movies
have a pattern.
- I can’t describe exactly this pattern.
- I have lots of movie watching data.
17
Supervised Learning
• The machine learning task of
learning a function that maps
an input to an output based
on example input-output
pairs.
• Classification
• Regression
Loan/Credit card approval:
Data: Input x1, x2, x3…Output y
SL is to learn a function f, that maps x to y.
Supervised Learning
• Linear regression
• Logistic regression
• Decision trees
• Support-vector machines
• K-nearest neighbor algorithm
• Neural networks (Multilayer
perceptron)
•…
[Link]
Covariance, Correlation and Regression
Coefficient
• Covariance is the
movement of two
random variables
moving around their
own means.
• Variance is a special
case of covariance.
Correlation
• Correlation is the standardized
value of covariance.
• The standardized values are
bound between -1 and 1.
• X and Y move in the same
direction, correlation is positive.
• X and Y move in opposite
directions, correlation is negative.
• No correlation: Cor(X, Y) = 0.
Pearson correlation coefficient of x and y
[Link]
Regression Coefficient
Exercise:
• X=[1,2,3,4,5,6]
• Y=[1.8,3.2,4.1,6.5,5.8,7.2]
• Calculate the slope for the regression model of Y on X: Use the
above formulae.
• X on Y?
Hands-on • import [Link] as plt
import numpy as np
• import mglearn
x=[Link]([1,2,3,4,5,6])
y=[Link]([1.8,3.2,4.1,6.5,5.8,7.2]) • X, y = [Link].make_wave(n_samples=40)
cov_matrix = [Link](x,y) • [Link](X, y, 'o')
print(cov_matrix)
• [Link](-3, 3)
Out[22]: • [Link]("Feature")
array([[3.5 , 3.72 ], • [Link]("Target")
[3.72 , 4.33866667]])
• [Link]()
reg_coeff= cov_matrix[0,1]/cov_matrix[0,0]
z = [Link](x, y, 1)
In [34]: reg_coeff
Out[34]: 1.062857142857143
In [36]: z
Out[36]: array([1.06285714, 1.04666667])
Hands-on
• from sklearn.linear_model import LinearRegression
• from sklearn.model_selection import train_test_split
• import mglearn
• X, y = [Link].make_wave(n_samples=60)
• X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
• lr = LinearRegression().fit(X_train, y_train)
• print("lr.coef_: {}".format(lr.coef_))
• print("lr.intercept_: {}".format(lr.intercept_))
• print("Training set score: {:.2f}".format([Link](X_train, y_train)))
• print("Test set score: {:.2f}".format([Link](X_test, y_test)))
Linear Regression Hypothesis Testing
• X and Y are random variables with unknown joint distribution.
• 𝛽መ is a random variable, normally distributed or t-distributed with
df=n−k−1
• k is the number of predictor variables. k = 1.
Least Squares
1. from sklearn import datasets
2. import [Link] as sm
3. from scipy import stats
4. diabetes = datasets.load_diabetes()
5. X = [Link]
6. y = [Link]
7. X2 = sm.add_constant(X)
8. est = [Link](y, X2)
9. est2 = [Link]()
10. print([Link]())
Linear Regression
Cost (Loss) function
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import mglearn
import numpy as np
X, y = [Link].make_wave(n_samples=60)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
theta = [Link](X_train.T@X_train)@X_train.T@y_train
X_train2=np.c_[[Link](len(X_train)), X_train]
In [57]: theta
Out[57]: array([-0.03180434, 0.39390555])
lr.coef_: [0.39390555] lr.coef_: [0.39390555]
lr.intercept_: -0.031804343026759746 lr.intercept_: -0.031804343026759746
Gradient decent
Gradient decent vs Stochastic gradient decent
Linear regression
Order of the Polynomial
k=2
• k=0 → constant
• k=1 → line k=1
• k=2 → parabola
• … k=0
Cost
• k=8?
Underfitting vs Overfitting
Size
Cross validation
Testing Error
Training Error
Training Error
Bias vs Variance
[Link]
Regression: statistics or machine learning
• Machine learning: learn a function
• Statistics: learn a distribution
Likelihood
• Maximize log likelihood:
What does it mean?
• The least-squares regression corresponds to finding the maximum
likelihood estimate of θ.
• The target y follows Gaussian distribution.
[Link]
Logistic Regression
• This is Classification.
• Y is usually called “Label”.
Logistic or Sigmoid Function
[Link]
Logistic Function
• Maximum log likelihood:
Regularization
• Simple Linear Regression
• Minimize the residual sum of squares
• RSS=
• Regularization is to avoid overfitting by
penalizing high-valued regression
coefficients.
[Link]
Lasso: Least Absolute Shrinkage and Selection
• Lasso performs L1 regularization
• Where λ is the tuning parameter for the L1 penalty.
• λ = 0 – no coefficients are eliminated; result is the same as least squares
regression
• λ = ∞ – all coefficients are eliminated
• As λ increases, bias increases
• As λ decreases, variance increases
• L1 can reduce coefficients to zero and eliminate the variable.
Ridge regression
• Implements L2 regularization
• Where λ is the tuning parameter for the L2 penalty.
• λ = 0 – no coefficients are eliminated; result is the same as least squares
regression
• λ = ∞ – all coefficients are reduced and approach zero
• As λ increases, bias increases
• As λ decreases, variance increases
• All coefficients are shrunk by the same factor (none are eliminated).
Lasso vs. Ridge Regression
• Lasso method produces a
simpler and more
interpretable model.
• Lasso can reduce coefficients
to zero, performing feature
selection and eliminating
variables of no value.
• Lasso is generally more
accurate compared to Ridge
regression.
• Cross-validation shall be used.
Lasso exercise
• X, y = [Link].load_extended_boston()
• X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
• lr = LinearRegression().fit(X_train, y_train)
• print("Training set score: {:.2f}".format([Link](X_train, y_train)))
• print("Test set score: {:.2f}".format([Link](X_test, y_test)))
• from sklearn.linear_model import Lasso Training set score: 0.95
• lasso = Lasso().fit(X_train, y_train) Test set score: 0.61
• print("Training set score: {:.2f}".format([Link](X_train, y_train)))
Training set score: 0.29
• print("Test set score: {:.2f}".format([Link](X_test, y_test))) Test set score: 0.21
• print("Number of features used: {}".format([Link](lasso.coef_ != 0))) Number of features used: 4
Lasso exercise
• lasso001 = Lasso(alpha=0.01, max_iter=100000).fit(X_train, y_train)
• print("Training set score: {:.2f}".format([Link](X_train, y_train)))
• print("Test set score: {:.2f}".format([Link](X_test, y_test)))
• print("Number of features used: {}".format([Link](lasso001.coef_ != 0)))
• lasso00001 = Lasso(alpha=0.0001, max_iter=100000).fit(X_train, y_train)
• print("Training set score: {:.2f}".format([Link](X_train, y_train)))
• print("Test set score: {:.2f}".format([Link](X_test, y_test)))
• print("Number of features used: {}".format([Link](lasso00001.coef_ != 0)))
• [Link](lasso.coef_, 's', label="Lasso
alpha=1")
• [Link](lasso001.coef_, '^', label="Lasso
alpha=0.01")
• [Link](lasso00001.coef_, 'v',
label="Lasso alpha=0.0001")
• [Link](ncol=2, loc=(0, 1.05))
• [Link](-25, 25)
• [Link]("Coefficient index")
• [Link]("Coefficient magnitude")
• [Link]()
Values of Lambda
Non-zero coefficients
• Through cross-validation, we
can determine the value of λ
λmin λmax
that minimizes the out-of-
sample loss.
[Link]
[Link]/en/latest/glmnet_vignette.html
Log(λ)
Generalized Linear Models
• Regression
• Classification
η: The natural parameter
• The exponential family T(y): The sufficient statistic
a(η): log partition function
McCullagh and Nelder, Generalized Linear Models (2nd ed.)
Summary
• Relationship between AI, ML, and Supervised Learning.
• Linear Regression from probability and machine learning perspective.
• Gradient Descent one method to learn the function.
• Linear in terms of coefficients.
• Bias vs Variance.
• Regularization.
• Generalized Linear Method.