MACHINE LEARNING(BCS602)
MODULE 3
CHAPTER 5
REGRESSION ANALYSIS
5.1 Introduction to Regression
Regression analysis is a fundamental concept that consists of a set of machine learning methods
that predict a continuous outcome variable (y) based on the value of one or multiple predictor
variables (x).
OR
Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables.
Regression is a supervised learning technique which helps in finding the correlation between
variables.
It is mainly used for prediction, forecasting, time series modelling, and determining the causal-
effect relationship between variables.
Regression shows a line or curve that passes through all the datapoints on target-predictor
graph in such a way that the vertical distance between the datapoints and the regression line
is minimum." The distance between datapoints and line tells whether a model has captured a
strong relationship or not.
• Function of regression analysis is given by:
Y=f(x)
Here, y is called dependent variable and x is called independent variable.
Applications of Regression Analysis
➢ Sales of a goods or services
➢ Value of bonds in portfolio management
➢ Premium on insurance companies
➢ Yield of crop in agriculture
➢ Prices of real estate
5.2 INTRODUCTION TO LINEARITY, CORRELATION AND CAUSATION
A correlation is the statistical summary of the relationship between two sets of variables. It is
a core part of data exploratory analysis, and is a critical aspect of numerous advanced machine
learning techniques.
Correlation between two variables can be found using a scatter plot
Deepa S, Dept.Of CSE,RNSIT 1
MACHINE LEARNING(BCS602)
There are different types of correlation:
Positive Correlation: Two variables are said to be positively correlated when their values
move in the same direction. For example, in the image below, as the value for X increases, so
does the value for Y at a constant rate.
Negative Correlation: Finally, variables X and Y will be negatively correlated when their
values change in opposite directions, so here as the value for X increases, the value for Y
decreases at a constant rate.
Neutral Correlation: No relationship in the change of variables X and Y. In this case, the
values are completely random and do not show any sign of correlation, as shown in the
following image:
Causation
Causation is about relationship between two variables as x causes y. This is called x implies b.
Regression is different from causation. Causation indicates that one event is the result of the
occurrence of the other event; i.e. there is a causal relationship between the two events.
Linear and Non-Linear Relationships
The relationship between input features (variables) and the output (target) variable is
fundamental. These concepts have significant implications for the choice of algorithms, model
complexity, and predictive performance.
Linear relationship creates a straight line when plotted on a graph, a Non-Linear relationship
does not create a straight line but instead creates a curve.
Example:
Linear-the relationship between the hours spent studying and the grades obtained in a class.
Non-Linear-
Linearity:
Linear Relationship: A linear relationship between variables means that a change in one
variable is associated with a proportional change in another variable. Mathematically, it can be
represented as y = a * x + b, where y is the output, x is the input, and a and b are constants.
Deepa S, Dept.Of CSE,RNSIT 2
MACHINE LEARNING(BCS602)
Linear Models: Goal is to find the best-fitting line (plane in higher dimensions) to the data
points. Linear models are interpretable and work well when the relationship between variables
is close to being linear.
Limitations: Linear models may perform poorly when the relationship between variables is
non-linear. In such cases, they may underfit the data, meaning they are too simple to capture
the underlying patterns.
Non-Linearity:
Non-Linear Relationship: A non-linear relationship implies that the change in one variable is
not proportional to the change in another variable. Non-linear relationships can take various
forms, such as quadratic, exponential, logarithmic, or arbitrary shapes.
Non-Linear Models: Machine learning models like decision trees, random forests, support
vector machines with non-linear kernels, and neural networks can capture non-linear
relationships. These models are more flexible and can fit complex data patterns.
Benefits: Non-linear models can perform well when the underlying relationships in the data
are complex or when interactions between variables are non-linear. They have the capacity to
capture intricate patterns.
Types of Regression
Deepa S, Dept.Of CSE,RNSIT 3
MACHINE LEARNING(BCS602)
Linear Regression:
Single Independent Variable: Linear regression, also known as simple linear regression, is
used when there is a single independent variable (predictor) and one dependent variable
(target).
Equation: The linear regression equation takes the form: Y = β0 + β1X + ε,
Where
Y is the dependent variable,
X is the independent variable,
β0 is the intercept,
β1 is the slope (coefficient), and
ε is the error term.
Linear regression is used to establish a linear relationship between two variables and make
predictions based on this relationship. It's suitable for simple scenarios where there's only one
predictor.
Multiple Regression:
Multiple Independent Variables: Multiple regression, as the name suggests, is used when there
are two or more independent variables (predictors) and one dependent variable (target).
Equation: The multiple regression equation extends the concept to multiple predictors:
Y = β0 + β1X1 + β2X2 + ... + βnXn + ε,
where
Y is the dependent variable,
X1, X2, ..., Xn are the independent variables,
β0 is the intercept, β1, β2, ..., βn are the coefficients, and
ε is the error term.
Multiple regression allows you to model the relationship between the dependent variable and
multiple predictors simultaneously. It's used when there are multiple factors that may influence
the target variable, and you want to understand their combined effect and make predictions
based on all these factors.
Polynomial Regression:
Polynomial regression is an extension of multiple regression used when the relationship
between the independent and dependent variables is non-linear.
Deepa S, Dept.Of CSE,RNSIT 4
MACHINE LEARNING(BCS602)
Equation: The polynomial regression equation allows for higher-order terms, such as quadratic
or cubic terms: Y = β0 + β1X + β2X^2 + ... + βnX^n + ε. This allows the model to fit a curve
rather than a straight line.
Logistic Regression:
Use: Logistic regression is used when the dependent variable is binary (0 or 1). It models the
probability of the dependent variable belonging to a particular class.
Equation: Logistic regression uses the logistic function (sigmoid function) to model
probabilities: P(Y=1) = 1 / (1 + e^(-z)),
where z is a linear combination of the independent variables: z = β0 + β1X1 + β2X2 + ... +
βnXn. It transforms this probability into a binary outcome.
Limitations of Regression
1. Outliers - Outliers are abnormal data. It can bias the outcome of the regression model, as
outliers push the regression line towards it.
2. Number of cases - The ratio of independent and dependent variables should be at least 20:1.
For every explanatory variable, there should be at least 20 samples. Atleast five samples are
required in extreme cases.
3. Missing data - Missing data in training data can make the model unfit for the sampled data.
4. Multicollinearity - If exploratory variables are highly correlated (0.9 and above), the
regression is vulnerable to bias. Singularity leads to perfect correlation of 1. The remedy is to
remove exploratory variables that exhibit correlation more than 1. If there is a tie, then the
tolerance (1-R squared) is used to eliminate variables that have the greatest value.
5.3 INTRODUCTION TO LINEAR REGRESSION
Linear regression model can be created by fitting a line among the scattered data points. The
line is of the form:
Deepa S, Dept.Of CSE,RNSIT 5
MACHINE LEARNING(BCS602)
The assumptions of linear regression are listed as follows:
1. The observations (y) are random and are mutually independent.
2. The difference between the predicted and true values is called an error. The error is also
mutually independent with the same distributions such as normal distribution with zero mean
and constant variables.
3. The distribution of the error term is independent of the joint distribution of explanatory
variables.
4. The unknown parameters of the regression models are constants.
Ordinary Least Square Approach
The ordinary least squares (OLS) algorithm is a method for estimating the parameters of a
linear regression model. Aim: To find the values of the linear regression model's parameters
(i.e., the coefficients) that minimize the sum of the squared residuals.
In mathematical terms, this can be written as: Minimize ∑(yi – ŷi)^2
where yi is the actual value, ŷi is the predicted value.
A linear regression model used for determining the value of the response variable, ŷ, can be
represented as the following equation.
y = b0 + b1x1 + b2x2 + … + bnxn + e
• where: y - is the dependent variable, b0 is the intercept, e is
the error term
• b1, b2, …, bn are the coefficients of the independent
variables x1, x2, …, xn
The coefficients b1, b2, …, bn can also be called the coefficients
of determination. The goal of the OLS method can be used to
estimate the unknown parameters (b1, b2, …, bn) by minimizing
the sum of squared residuals (RSS). The sum of squared residuals
is also termed the sum of squared error (SSE).
This method is also known as the least-squares method for regression or linear regression.
Mathematically the line of equations for points are:
y1=(a0+a1x1)+e1
y2=(a0+a1x2)+e2 and so on
……. yn=(a0+a1xn)+en.
In general ei=yi - (a0+a1x1)
Deepa S, Dept.Of CSE,RNSIT 6
MACHINE LEARNING(BCS602)
Linear Regression Example
Deepa S, Dept.Of CSE,RNSIT 7
MACHINE LEARNING(BCS602)
Deepa S, Dept.Of CSE,RNSIT 8
MACHINE LEARNING(BCS602)
Linear Regression in Matrix Form
Deepa S, Dept.Of CSE,RNSIT 9
MACHINE LEARNING(BCS602)
Deepa S, Dept.Of CSE,RNSIT 10