UNIT 3: Regression
BTCS 618‐18
Dr. Vandana Mohindru
Topics to be discussed
• Introduction to Regression
• Need and Applications of Regression
• Simple Linear Regression
• Multiple Linear Regression
• Polynomial Regression
• Evaluating Regression Models Performance
• Mean Absolute Error (MAE)
• Mean Squared Error (MSE)
• Root Mean Squared Error (RMSE)
• Rsquare
• Scatter plot
Introduction to Regression
• Regression is a statistical method to model the relationship between a
dependent (target) and independent (predictor) variables with one or
more independent variables.
• More specifically, Regression analysis helps us to understand how the
value of the dependent variable is changing corresponding to an
independent variable when other independent variables are held fixed. It
predicts continuous/real values such as temperature, age, salary,
price, etc.
• We can understand the concept of regression analysis using the below
example:
• Example: Suppose there is a marketing company A, who does various
advertisement every year and get sales on that. The below list shows the
advertisement made by the company in the last 5 years and the
corresponding sales:
Introduction to Regression
• Now, the company wants to do the advertisement of $200 in the year
2019 and wants to know the prediction about the sales for this year. So
to solve such type of prediction problems in machine learning, we need
regression analysis.
Introduction to Regression
• Regression is a supervised learning technique which helps in finding the
correlation between variables and enables us to predict the continuous
output variable based on the one or more predictor variables.
• It is mainly used for prediction, forecasting, time series modeling, and
determining the causal‐effect relationship between variables.
• In Regression, we plot a graph between the variables which best fits the
given datapoints, using this plot, the machine learning model can make
predictions about the data.
• In simple words, "Regression shows a line or curve that passes through
all the datapoints on target‐predictor graph in such a way that the
vertical distance between the datapoints and the regression line is
minimum." The distance between datapoints and line tells whether a
model has captured a strong relationship or not.
Terminologies Related to the Regression Analysis
• Dependent Variable: The main factor in Regression analysis which we want to
predict or understand is called the dependent variable. It is also called target
variable.
• Independent Variable: The factors which affect the dependent variables or
which are used to predict the values of the dependent variables are called
independent variable, also called as a predictor.
• Outliers: Outlier is an observation which contains either very low value or very
high value in comparison to other observed values. An outlier may hamper the
result, so it should be avoided.
• Multicollinearity: If the independent variables are highly correlated with each
other than other variables, then such condition is called Multicollinearity. It
should not be present in the dataset, because it creates problem while ranking
the most affecting variable.
• Underfitting and Overfitting: If our algorithm works well with the training
dataset but not well with test dataset, then such problem is called Overfitting.
And if our algorithm does not perform well even with training dataset, then
such problem is called underfitting.
Need and Applications of Regression
• Regression analysis helps in the prediction of a continuous variable. There
are various scenarios in the real world where we need some future
predictions such as weather condition, sales prediction, marketing trends,
etc., for such case we need some technology which can make predictions
more accurately.
• So for such case we need Regression analysis which is a statistical method
and used in machine learning and data science. Below are some other
reasons for using Regression analysis:
• Regression estimates the relationship between the target and the
independent variable.
• It is used to find the trends in data.
• It helps to predict real/continuous values.
• By performing the regression, we can confidently determine the most important
factor, the least important factor, and how each factor is affecting the other
factors.
Need and Applications of Regression
Some applications of regression can be as:
• Prediction of rain using temperature and other factors
• Determining Market trends
• Prediction of road accidents due to rash driving.
Simple Linear Regression
• Simple Linear Regression is a type of Regression algorithms that models
the relationship between a dependent variable and a single independent
variable. The relationship shown by a Simple Linear Regression model is
linear or a sloped straight line, hence it is called Simple Linear
Regression.
• The key point in Simple Linear Regression is that the dependent variable
must be a continuous/real value. However, the independent variable
can be measured on continuous or categorical values.
• Simple Linear regression algorithm has mainly two objectives:
• Model the relationship between the two variables. Such as the relationship
between Income and expenditure, experience and Salary, etc.
• Forecasting new observations. Such as Weather forecasting according to
temperature, Revenue of a company according to the investments in a year, etc.
Simple Linear Regression
Simple Linear Regression Model:
The Simple Linear Regression model can be represented using
the below equation:
y= a0+a1x+ ε
Where,
a0= It is the intercept of the Regression line (can be obtained putting
x=0)
a1= It is the slope of the regression line, which tells whether the line is
increasing or decreasing.
ε = The error term. (For a good model it will be negligible)
Simple Linear Regression
• In the given plot, we can see the
real values observations in green
dots and predicted values are
covered by the red regression line.
The regression line shows a
correlation between the
dependent and independent
variable.
• The good fit of the line can be
observed by calculating the
difference between actual values
and predicted values. But as we
can see in the given plot, most of
the observations are close to the
regression line, hence our model is
good for the training set.
Simple Linear Regression
• In the given plot, there are
observations given by the blue
color, and prediction is given by
the red regression line. As we
can see, most of the
observations are close to the
regression line, hence we can
say our Simple Linear Regression
is a good model and able to
make good predictions.
Multiple Linear Regression
• In the previous topic, we have learned about Simple Linear Regression,
where a single Independent/Predictor(X) variable is used to model the
response variable (Y). But there may be various cases in which the
response variable is affected by more than one predictor variable; for
such cases, the Multiple Linear Regression algorithm is used.
• Moreover, Multiple Linear Regression is an extension of Simple Linear
regression as it takes more than one predictor variable to predict the
response variable. We can define it as:
• Multiple Linear Regression is one of the important regression
algorithms which models the linear relationship between a single
dependent continuous variable and more than one independent
variable.
Multiple Linear Regression
Example:
• Prediction of CO2 emission based on engine size and number of cylinders
in a car.
Some key points about MLR:
• For MLR, the dependent or target variable(Y) must be the
continuous/real, but the predictor or independent variable may be of
continuous or categorical form.
• Each feature variable must model the linear relationship with the
dependent variable.
• MLR tries to fit a regression line through a multidimensional space of
data‐points.
Multiple Linear Regression
MLR Equation
In Multiple Linear Regression, the target variable(Y) is a linear combination
of multiple predictor variables x1, x2, x3, ...,xn. Since it is an enhancement
of Simple Linear Regression, so the same is applied for the multiple linear r
egression equation, the equation becomes:
Y = b0 + b1x1 + b2x2 + b3x3 + ………………..bnxn
Where,
Y = Output/Response variable
b0, b1, b2, b3 , bn....= Coefficients of the model.
x1, x2, x3, x4,...= Various Independent/feature variable
Multiple Linear Regression
Assumptions for Multiple Linear Regression:
• A linear relationship should exist between the Target and predictor
variables.
• The regression residuals must be normally distributed.
• MLR assumes little or no multicollinearity (correlation between the
independent variable) in data.
Applications of Multiple Linear Regression:
• Effectiveness of Independent variable on prediction
• Predicting the impact of changes
Polynomial Regression
• Polynomial Regression is a regression algorithm that models the
relationship between a dependent(y) and independent variable(x) as nth
degree polynomial. The Polynomial Regression equation is given below:
y = b0+ b1x1+ b2x12+ b2x13+...... bnx1n
• It is also called the special case of Multiple Linear Regression in ML.
Because we add some polynomial terms to the Multiple Linear
regression equation to convert it into Polynomial Regression.
• It is a linear model with some modification in order to increase the
accuracy.
• The dataset used in Polynomial regression for training is of non‐linear
nature.
Polynomial Regression
• It makes use of a linear regression model to fit the complicated and non‐
linear functions and datasets.
• Hence, "In Polynomial regression, the original features are converted
into Polynomial features of required degree (2,3,..,n) and then modeled
using a linear model."
Polynomial Regression
Need for Polynomial Regression:
• If we apply a linear model on a linear dataset, then it provides us a good
result as we have seen in Simple Linear Regression, but if we apply the
same model without any modification on a non‐linear dataset, then it
will produce a drastic output. Due to which loss function will increase,
the error rate will be high, and accuracy will be decreased.
• So for such cases, where data points are arranged in a non‐linear
fashion, we need the Polynomial Regression model. We can understand
it in a better way using the below comparison diagram of the linear
dataset and non‐linear dataset.
Polynomial Regression
Need for Polynomial Regression:
Polynomial Regression
Need for Polynomial Regression:
• In the above image, we have taken a dataset which is arranged non‐
linearly. So if we try to cover it with a linear model, then we can clearly
see that it hardly covers any data point. On the other hand, a curve is
suitable to cover most of the data points, which is of the Polynomial
model.
• Hence, if the datasets are arranged in a non‐linear fashion, then we
should use the Polynomial Regression model instead of Simple Linear
Regression.
Polynomial Regression
Equation of the Polynomial Regression Model:
• Simple Linear Regression equation: y = b0+b1x .........(a)
• Multiple Linear Regression equation: y= b0+b1x+ b2x2+ b3x3+....+ bnxn ......(b)
• Polynomial Regression equation: y= b0+b1x + b2x2+ b3x3+....+ bnxn .........(c)
When we compare the above three equations, we can clearly see that all three
equations are Polynomial equations but differ by the degree of variables. The
Simple and Multiple Linear equations are also Polynomial equations with a
single degree, and the Polynomial regression equation is Linear equation with
the nth degree. So if we add a degree to our linear equations, then it will be
converted into Polynomial Linear equations.
Evaluating Regression Models Performance
• Regression analysis is a subfield of supervised machine learning. It aims
to model the relationship between a certain number of features and a
continuous target variable. Following are the performance metrics used
for evaluating a regression model:
1. Mean Absolute Error (MAE)
2. Mean Squared Error (MSE)
3. Root Mean Squared Error (RMSE)
4. Rsquare
5. Scatter plot
Evaluating Regression Models Performance
Let’s keep one thing in mind, what is an Error?
Any deviation from the actual value is an error,
Error = Y (actual) — Y (predicted)
So keeping this in mind, we have understood the requirement of
the metrics, let’s deep dive into the methods we can use to find
out ways to understand out model’s performance.
Evaluating Regression Models Performance
1. Mean Absolute Error (MAE):
where yᵢ is the actual expected output and ŷᵢ is the model’s prediction.
It is the simplest evaluation metric for a regression scenario.
Say, yᵢ = [5,10,15,20] and
ŷᵢ = [4.8,10.6,14.3,20.1],
Thus, MAE = 1/4 * (|5‐4.8|+|10‐10.6|+|15‐14.3|+|20‐20.1|) = 0.4
Evaluating Regression Models Performance
2. Mean Squared Error (MSE):
Here, the error term is squared and thus more sensitive to outliers as
compared to Mean Absolute Error (MAE).
Thus, MSE = 1/4 * (|5‐4.8|^2+|10‐10.6|^2+|15‐14.3|^2+|20‐20.1|^2) =
0.225
Evaluating Regression Models Performance
3. Root Mean Squared Error (RMSE):
Since MSE includes squared error terms, we take the square root of the
MSE, which gives rise to Root Mean Squared Error (RMSE).
Thus, RMSE = (0.225)^0.5 = 0.474
Evaluating Regression Models Performance
4. R‐Squared:
• R‐squared is calculated by dividing the sum of squares of residuals (SSRES) from
the regression model by the total sum of squares (SSTOT) of errors from the
average model and then subtract it from 1.
• R‐squared is also known as the Coefficient of Determination. It explains the
degree to which the input variables explain the variation of the output /
predicted variable.
• A R‐squared value of 0.81, tells that the input variables explains 81 % of the
variation in the output variable. The higher the R squared, the more variation
is explained by the input variables and better is the model.
Although, there exists a limitation in this metric, which is solved by the
Adjusted R‐squared.
Evaluating Regression Models Performance
5. Scatter Plot: Scatter plots are often used to identify relationships
between two variables, such as experience and salary.
Evaluating Regression Models Performance
5. Scatter Plot:
• The relationship between the two variables is called the Correlation; the
closer the data comes to making a straight line, the stronger the
correlation.
• When analyzing scatter plots, the viewer also looks for the slope and
strength of the data pattern.
• Slope refers to the direction of change in one variable when the other
gets bigger.
• Strength refers to the scatter of the plot: if the points are tightly
concentrated around a line, the relationship is strong.
• Scatter plots can also show unusual features of the data set, such as
clusters, patterns, or outliers, that would be hidden if the data were
merely in a table.