0% found this document useful (0 votes)
4 views20 pages

Regression

Regression analysis is a statistical method used to model the relationship between dependent and independent variables, enabling predictions of continuous values. It includes techniques like simple linear regression, multiple linear regression, and polynomial regression, each suited for different types of data relationships. Applications range from forecasting sales and market trends to predicting outcomes in various fields such as economics and environmental science.

Uploaded by

Karuna Salgotra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views20 pages

Regression

Regression analysis is a statistical method used to model the relationship between dependent and independent variables, enabling predictions of continuous values. It includes techniques like simple linear regression, multiple linear regression, and polynomial regression, each suited for different types of data relationships. Applications range from forecasting sales and market trends to predicting outcomes in various fields such as economics and environmental science.

Uploaded by

Karuna Salgotra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

REGRESSION

Regression analysis is a statistical method to model the


relationship between a dependent (target) and independent
(predictor) variables with one or more independent variables.
More specifically, Regression analysis helps us to understand
how the value of the dependent variable is changing
corresponding to an independent variable when other
independent variables are held fixed. It predicts continuous/real
values such as temperature, age, salary, price, etc.

Regression is a supervised learning technique which helps in


finding the correlation between variables and enables us to
predict the continuous output variable based on the one or more
predictor variables. It is mainly used for prediction,
forecasting, time series modeling, and determining the
causal-effect relationship between variables.
In Regression, we plot a graph between the variables which best
fits the given data points, using this plot, the machine learning
model can make predictions about the data. In simple
words, "Regression shows a line or curve that passes through
all the data points on target-predictor graph in such a way
that the vertical distance between the data points and the
regression line is minimum." The distance between data points
and line tells whether a model has captured a strong relationship
or not.
Some examples of regression can be as:
o Prediction of rain using temperature and other factors
o Determining Market trends
o Prediction of road accidents due to rash driving.
Terminologies Related to the Regression Analysis:
o Dependent Variable: The main factor in Regression
analysis which we want to predict or understand is called
the dependent variable. It is also called target variable.

o Independent Variable: The factors which affect the


dependent variables or which are used to predict the values
of the dependent variables are called independent variable,
also called as a predictor.

o Outliers: Outlier is an observation which contains either


very low value or very high value in comparison to other
observed values. An outlier may hamper the result, so it
should be avoided.
o Multicollinearity: If the independent variables are highly
correlated with each other than other variables, then such
condition is called Multicollinearity. It should not be
present in the dataset, because it creates problem while
ranking the most affecting variable.

o Underfitting and Overfitting: If our algorithm works well


with the training dataset but not well with test dataset, then
such problem is called Overfitting. And if our algorithm
does not perform well even with training dataset, then such
problem is called underfitting.

NEED OF REGRESSION
Regression analysis helps in the prediction of a continuous
variable. There are various scenarios in the real world where we
need some future predictions such as weather condition, sales
prediction, marketing trends, etc., for such case we need some
technology which can make predictions more accurately. So for
such case we need Regression analysis which is a statistical
method and used in machine learning and data science.
o Regression estimates the relationship between the target
and the independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently
determine the most important factor, the least important
factor, and how each factor is affecting the other
factors.
APPLICATIONS OF REGRESSION:
 Forecasting continuous outcomes like house prices, stock
prices, or sales.
 Predicting the success of future retail sales or marketing
campaigns to ensure resources are used effectively.
 Predicting customer or user trends, such as on streaming
services or ecommerce websites.
 Analyzing datasets to establish the relationships between
variables and an output.
 Predicting interest rates or stock prices from a variety of
factors.
 Creating time series visualizations.
SIMPLE LINEAR REGRESSION
o Linear regression is a statistical regression method which is
used for predictive analysis.
o It is one of the very simple and easy algorithms which
works on regression and shows the relationship between the
continuous variables.
o It is used for solving the regression problem in machine
learning.
o Linear regression shows the linear relationship between the
independent variable (X-axis) and the dependent variable
(Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear
regression is called simple linear regression. And if there
is more than one input variable, then such linear regression
is called multiple linear regression.
Simple Linear regression algorithm has mainly two
objectives:
o Model the relationship between the two variables. Such
as the relationship between Income and expenditure,
experience and Salary, etc.
o Forecasting new observations. Such as Weather
forecasting according to temperature, Revenue of a
company according to the investments in a year, etc.
How to perform a simple linear regression
Simple linear regression formula
The formula for a simple linear regression is:

 y is the predicted value of the dependent variable (y) for


any given value of the independent variable (x).
 β0 is the intercept, the predicted value of y when the x is 0.
 β1 is the regression coefficient – how much we expect y to
change as x increases.
 x is the independent variable ( the variable we expect is
influencing y).
 ԑ is the error of the estimate, or how much variation there
is in our estimate of the regression coefficient.

How to Find the Regression Equation


In the table below, the xi column shows scores on the aptitude
test. Similarly, the yi column shows statistics grades. The last
two columns show deviations scores - the difference between
the student's score and the average score on each test. The last
two rows show sums and mean scores that we will use to
conduct the regression analysis.

And for each student, we also need to compute the squares of


the deviation scores
And finally, for each student, we need to compute the product of
the deviation scores.

The regression equation is a linear equation of the form:


ŷ = b0 + b1x . To conduct a regression analysis, we need to
solve for b0 and b1.
First, we solve for the regression coefficient (b 1):

Once we know the value of the regression coefficient (b1), we


can solve for the regression slope (b0):

Therefore, the regression equation is: ŷ = 26.768 + 0.644x .

ŷ = b0 + b1x

ŷ = 26.768 + 0.644x = 26.768 + 0.644 * 80

ŷ = 26.768 + 51.52 = 78.288


MULTIPLE LINEAR REGRESSIONS
Multiple Linear Regression is one of the important regression
algorithms which model the linear relationship between a single
dependent continuous variable and more than one independent
variable.

Example:
Prediction of CO2 emission based on engine size and number of
cylinders in a car.
Some key points about MLR:
o For MLR, the dependent or target variable(Y) must be the
continuous/real, but the predictor or independent variable
may be of continuous or categorical form.
o Each feature variable must model the linear relationship
with the dependent variable.
o MLR tries to fit a regression line through a
multidimensional space of data-points.
ŷ = a+b1x1+b2x2+…………..bnxn
ŷ represents the dependent variable
a represents the dependent variable axis intercept
n signifies the number of variables
x1-xn are the independent variables
b1-bn are coefficient parameters
Regression sum calculations:
 Σx12 = ΣX12 – (ΣX1)2 / n
 Σx22 = ΣX22 – (ΣX2)2 / n
 Σx1y = ΣX1y – (ΣX1Σy) / n
 Σx2y = ΣX2y – (ΣX2Σy) / n
 Σx1x2 = ΣX1X2 – (ΣX1ΣX2) / n

Example: Multiple Linear Regression

Suppose we have the following dataset with one response


variable y and two predictor variables X1 and X2:
Use the following steps to fit a multiple linear regression model
to this dataset.
Step 1: Calculate X12, X22, X1y, X2y and X1X2.

Step 2: Calculate Regression Sums.


Next, make the following regression sum calculations:
 Σx12 = ΣX12 – (ΣX1)2 / n = 38,767 – (555)2 / 8 = 263.875
 Σx22 = ΣX22 – (ΣX2)2 / n = 2,823 – (145)2 / 8 = 194.875
 Σx1y = ΣX1y – (ΣX1Σy) / n = 101,895 – (555*1,452) / 8
= 1,162.5
 Σx2y = ΣX2y – (ΣX2Σy) / n = 25,364 – (145*1,452) / 8 = -
953.5
 Σx1x2 = ΣX1X2 – (ΣX1ΣX2) / n = 9,859 – (555*145) / 8 = -
200.375
Step 3: Calculate b0, b1, and b2.
The formula to calculate b1 is: [(Σx22)(Σx1y) – (Σx1x2)(Σx2y)] /
[(Σx12) (Σx22) – (Σx1x2)2]
Thus, b1 = [(194.875)(1162.5) – (-200.375)(-953.5)] /
[(263.875) (194.875) – (-200.375) ] = 3.148
2

The formula to calculate b2 is: [(Σx12)(Σx2y) – (Σx1x2)(Σx1y)] /


[(Σx12) (Σx22) – (Σx1x2)2]
Thus, b2 = [(263.875)(-953.5) – (-200.375)(1152.5)] /
[(263.875) (194.875) – (-200.375)2] = -1.656

Thus, b0 = 181.5 – 3.148(69.375) – (-1.656)(18.125) = -6.867


Step 5: Place b0, b1, and b2 in the estimated linear regression
equation.
The estimated linear regression equation is: ŷ = b0 + b1*x1 +
b2*x2
In our example, it is ŷ = -6.867 + 3.148x1 – 1.656x2
POLYNOMIAL REGRESSION
In polynomial regression, the relationship between the
independent variable x and the dependent variable y is described
as an nth degree polynomial in x.
Polynomial regression is needed when there is no linear
correlation fitting all the variables. So instead of looking like a
line, it looks like a nonlinear function.
TYPES OF POLYNOMIAL REGRESSION

1. Linear – if degree as 1

2. Quadratic – if degree as 2

3. Cubic – if degree as 3 and goes on, on the basis of degree.

MATHEMATICAL EQUATION:
y=a+a1*x+a2*x2 +………+ +an*xn

Let the quadratic polynomial regression model be

y=a+a1*x+a2*x2

The values of a, a1, and a2 are calculated using the following


system of equations:
First, we calculate the required variables and note them in the
following table.

Using the given data we,


Solving this system of equations we get

a=12.4285714
a1=-5.5128571
a2=0.7642857

The required quadratic polynomial model is

y=12.4285714 -5.5128571 * x +0.7642857 * x2

NEED FOR POLYNOMIAL REGRESSION


o If we apply a linear model on a linear dataset, then it
provides us a good result as we have seen in Simple Linear
Regression, but if we apply the same model without any
modification on a non-linear dataset, then it will produce
a drastic output. Due to which loss function will increase,
the error rate will be high, and accuracy will be decreased.

o So for such cases, where data points are arranged in a


non-linear fashion, we need the Polynomial Regression
model. We can understand it in a better way using the
below comparison diagram of the linear dataset and non-
linear dataset.
o In the above image, we have taken a dataset which is
arranged non-linearly. So if we try to cover it with a linear
model, then we can clearly see that it hardly covers any
data point. On the other hand, a curve is suitable to cover
most of the data points, which is of the Polynomial model.

o Hence, if the datasets are arranged in a non-linear fashion,


then we should use the Polynomial Regression model
instead of Simple Linear Regression.
ADVANTAGES OF POLYNOMIAL REGRESSION

 You can model non-linear relationships between variables.


 There is a large range of different functions that you can use for
fitting.
 Good for exploration purposes: you can test for the presence of
curvature and its inflections.
 It is a flexible tool that can be used to fit a large variety of data
point distributions.
DISADVANTAGES OF POLYNOMIAL REGRESSION
 Even a single outlier in the data plot can seriously mess up the
results.
 PR models are prone to overfitting. If enough parameters are
used, you can fit anything. As John von Neumann reportedly
said: “with four parameters I can fit an elephant, with five I can
make him wiggle his trunk.”
 As a consequence of the previous, PR models might not
generalize well outside of the data used.
POLYNOMIAL REGRESSION USED IN:
 Death rate prediction
 Tissue growth rate prediction
 Speed regulation software

You might also like