0% found this document useful (0 votes)
21 views48 pages

Unit 3

Uploaded by

abhi.skamboj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views48 pages

Unit 3

Uploaded by

abhi.skamboj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Regression Analysis in Machine learning

Regression analysis is a statistical method to model the relationship between a dependent


(target) and independent (predictor) variables with one or more independent variables. More
specifically, Regression analysis helps us to understand how the value of the dependent
variable is changing corresponding to an independent variable when other independent
variables are held fixed. It predicts continuous/real values such as temperature, age, salary,
price, etc.

We can understand the concept of regression analysis using the below example:

Example: Suppose there is a marketing company A, who does various advertisement every
year and get sales on that. The below list shows the advertisement made by the company in
the last 5 years and the corresponding sales:

Now, the company wants to do the advertisement of $200 in the year 2019 and wants to
know the prediction about the sales for this year. So to solve such type of prediction
problems in machine learning, we need regression analysis.

Regression is a supervised learning technique which helps in finding the correlation between
variables and enables us to predict the continuous output variable based on the one or more
predictor variables. It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables.

In Regression, we plot a graph between the variables which best fits the given datapoints,
using this plot, the machine learning model can make predictions about the data. In simple
words, "Regression shows a line or curve that passes through all the datapoints on target-
predictor graph in such a way that the vertical distance between the datapoints and the
regression line is minimum." The distance between datapoints and line tells whether a model
has captured a strong relationship or not.
Some examples of regression can be as:

o Prediction of rain using temperature and other factors


o Determining Market trends
o Prediction of road accidents due to rash driving.

Terminologies Related to the Regression Analysis:


o Dependent Variable: The main factor in Regression analysis which we want to
predict or understand is called the dependent variable. It is also called target
variable.
o Independent Variable: The factors which affect the dependent variables or which
are used to predict the values of the dependent variables are called independent
variable, also called as a predictor.
o Outliers: Outlier is an observation which contains either very low value or very high
value in comparison to other observed values. An outlier may hamper the result, so it
should be avoided.
o Multicollinearity: If the independent variables are highly correlated with each other
than other variables, then such condition is called Multicollinearity. It should not be
present in the dataset, because it creates problem while ranking the most affecting
variable.
o Underfitting and Overfitting: If our algorithm works well with the training dataset
but not well with test dataset, then such problem is called Overfitting. And if our
algorithm does not perform well even with training dataset, then such problem is
called underfitting.

Why do we use Regression Analysis?

As mentioned above, Regression analysis helps in the prediction of a continuous variable.


There are various scenarios in the real world where we need some future predictions such as
weather condition, sales prediction, marketing trends, etc., for such case we need some
technology which can make predictions more accurately. So for such case we need
Regression analysis which is a statistical method and used in machine learning and data
science. Below are some other reasons for using Regression analysis:

o Regression estimates the relationship between the target and the independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important
factor, the least important factor, and how each factor is affecting the other
factors.

Types of Regression-

There are various types of regressions which are used in data science and machine learning.
Each type has its own importance on different scenarios, but at the core, all the regression
methods analyze the effect of the independent variable on dependent variables. Here we are
discussing some important types of regression which are given below:

o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
Linear Regression:
o Linear regression is a statistical regression method which is used for predictive
analysis.
o It is one of the very simple and easy algorithms which works on regression and shows
the relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-
axis) and the dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple
linear regression. And if there is more than one input variable, then such linear
regression is called multiple linear regression.
o The relationship between variables in the linear regression model can be explained
using the below image. Here we are predicting the salary of an employee on the basis
of the year of experience.

o Below is the mathematical equation for Linear regression:

1. Y= aX+b

Here, Y = dependent variables (target variables),


X= Independent variables (predictor variables),
a and b are the linear coefficients

Some popular applications of linear regression are:

o Analyzing trends and sales estimates


o Salary forecasting
o Real estate prediction
o Arriving at ETAs in traffic.

Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a
binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes
or No, True or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear regression
algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex
cost function. This sigmoid function is used to model the data in logistic regression.
The function can be represented as:

o f(x)= Output between the 0 and 1 value.


o x= input to the function
o e= base of natural logarithm.

When we provide the input values (data) to the function, it gives the S-curve as follows:
o It uses the concept of threshold levels, values above the threshold level are rounded
up to 1, and values below the threshold level are rounded up to 0.

There are three types of logistic regression:

o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)

Polynomial Regression:
o Polynomial Regression is a type of regression which models the non-linear
dataset using a linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the
value of x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a non-
linear fashion, so for such case, linear regression will not best fit to those datapoints.
To cover such datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into polynomial
features of given degree and then modeled using a linear model. Which means the
datapoints are best fitted using a polynomial line.
o The equation for polynomial regression also derived from linear regression equation
that means Linear regression equation Y= b0+ b1x, is transformed into Polynomial
regression equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x
is our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic

Note: This is different from Multiple Linear regression in such a way that in Polynomial
regression, a single element has different degrees instead of multiple variables with the
same degree.
Support Vector Regression:

Support Vector Machine is a supervised learning algorithm which can be used for regression
as well as classification problems. So if we use it for regression problems, then it is termed as
Support Vector Regression.

Support Vector Regression is a regression algorithm which works for continuous variables.
Below are some keywords which are used in Support Vector Regression:

o Kernel: It is a function used to map a lower-dimensional data into higher dimensional


data.
o Hyperplane: In general SVM, it is a separation line between two classes, but in SVR,
it is a line which helps to predict the continuous variables and cover most of the
datapoints.
o Boundary line: Boundary lines are the two lines apart from hyperplane, which
creates a margin for datapoints.
o Support vectors: Support vectors are the datapoints which are nearest to the
hyperplane and opposite class.

In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum
number of datapoints are covered in that margin. The main goal of SVR is to consider the
maximum datapoints within the boundary lines and the hyperplane (best-fit line) must
contain a maximum number of datapoints. Consider the below image:

Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.

Decision Tree Regression:


o Decision Tree is a supervised learning algorithm which can be used for solving both
classification and regression problems.
o It can solve problems for both categorical and numerical data
o Decision Tree regression builds a tree-like structure in which each internal node
represents the "test" for an attribute, each branch represent the result of the test, and
each leaf node represents the final decision or result.
o A decision tree is constructed starting from the root node/parent node (dataset), which
splits into left and right child nodes (subsets of dataset). These child nodes are further
divided into their children node, and themselves become the parent node of those
nodes. Consider the below image:
Above image showing the example of Decision Tee regression, here, the model is trying to
predict the choice of a person between Sports cars or Luxury car.

o Random forest is one of the most powerful supervised learning algorithms which is
capable of performing regression as well as classification tasks.
o The Random Forest regression is an ensemble learning method which combines
multiple decision trees and predicts the final output based on the average of each tree
output. The combined decision trees are called as base models, and it can be
represented more formally as:

g(x)= f0(x)+ f1(x)+ f2(x)+....


o Random forest uses Bagging or Bootstrap Aggregation technique of ensemble
learning in which aggregated decision tree runs in parallel and do not interact with
each other.
o With the help of Random Forest regression, we can prevent Overfitting in the model
by creating random subsets of the dataset.
Ridge Regression:
o Ridge regression is one of the most robust versions of linear regression in which a
small amount of bias is introduced so that we can get better long term predictions.
o The amount of bias added to the model is known as Ridge Regression penalty. We
can compute this penalty term by multiplying with the lambda to the squared weight
of each individual features.
o The equation for ridge regression will be:

o A general linear or polynomial regression will fail if there is high collinearity between
the independent variables, so to solve such problems, Ridge regression can be used.
o Ridge regression is a regularization technique, which is used to reduce the complexity
of the model. It is also called as L2 regularization.
o It helps to solve the problems if we have more parameters than samples.

Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the
model.
o It is similar to the Ridge Regression except that penalty term contains only the
absolute weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge
Regression can only shrink it near to 0.
o It is also called as L1 regularization. The equation for Lasso regression will be:

Linear Regression in Machine Learning

Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression
shows the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable.

The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:
Mathematically, we can represent a linear regression as:

y= a0+a1x+ ε

Here,

Y= Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

The values for x and y variables are training datasets for Linear Regression model
representation.

Types of Linear Regression

Linear regression can be further divided into two types of the algorithm:

o Simple Linear Regression:


If a single independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Simple Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Multiple Linear
Regression.

Linear Regression Line

A linear line showing the relationship between the dependent and independent variables is
called a regression line. A regression line can show two types of relationship:

o Positive Linear Relationship:


If the dependent variable increases on the Y-axis and independent variable increases
on X-axis, then such a relationship is termed as a Positive linear relationship.

o Negative Linear Relationship:


If the dependent variable decreases on the Y-axis and independent variable increases
on the X-axis, then such a relationship is called a negative linear relationship.

Finding the best fit line:


When working with linear regression, our main goal is to find the best fit line that means the
error between predicted values and actual values should be minimized. The best fit line will
have the least error.

The different values for weights or the coefficient of lines (a0, a1) gives a different line of
regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to
calculate this we use cost function.

Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line
of regression, and the cost function is used to estimate the values of the coefficient for
the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a
linear regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable. This mapping function is also known
as Hypothesis function.

For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the
average of squared error occurred between the predicted values and actual values. It can be
written as:

For the above linear equation, MSE can be calculated as:

Where,

N=Total number of observation


Yi = Actual value
(a1xi+a0)= Predicted value.

Residuals: The distance between the actual value and predicted values is called residual. If
the observed points are far from the regression line, then the residual will be high, and so cost
function will high. If the scatter points are close to the regression line, then the residual will
be small and hence the cost function.

Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost
function.
o A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively update
the values to reach the minimum cost function.

Model Performance:

The Goodness of fit determines how the line of regression fits the set of observations. The
process of finding the best model out of various models is called optimization. It can be
achieved by below method:

1. R-squared method:

o R-squared is a statistical method that determines the goodness of fit.


o It measures the strength of the relationship between the dependent and independent
variables on a scale of 0-100%.
o The high value of R-square determines the less difference between the predicted
values and actual values and hence represents a good model.
o It is also called a coefficient of determination, or coefficient of multiple
determination for multiple regression.
o It can be calculated from the below formula:

Assumptions of Linear Regression

Below are some important assumptions of Linear Regression. These are some formal checks
while building a Linear Regression model, which ensures to get the best possible result from
the given dataset.

o Linear relationship between the features and target:


Linear regression assumes the linear relationship between the dependent and
independent variables.
o Small or no multicollinearity between the features:
Multicollinearity means high-correlation between the independent variables. Due to
multicollinearity, it may difficult to find the true relationship between the predictors
and target variables. Or we can say, it is difficult to determine which predictor
variable is affecting the target variable and which is not. So, the model assumes either
little or no multicollinearity between the features or independent variables.
o Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of
independent variables. With homoscedasticity, there should be no clear pattern
distribution of data in the scatter plot.
o Normal distribution of error terms:
Linear regression assumes that the error term should follow the normal distribution
pattern. If error terms are not normally distributed, then confidence intervals will
become either too wide or too narrow, which may cause difficulties in finding
coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without any
deviation, which means the error is normally distributed.
o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be
any correlation in the error term, then it will drastically reduce the accuracy of the
model. Autocorrelation usually occurs if there is a dependency between residual
errors.

Simple Linear Regression in Machine Learning

Simple Linear Regression is a type of Regression algorithms that models the relationship
between a dependent variable and a single independent variable. The relationship shown by a
Simple Linear Regression model is linear or a sloped straight line, hence it is called Simple
Linear Regression.

The key point in Simple Linear Regression is that the dependent variable must be a
continuous/real value. However, the independent variable can be measured on continuous or
categorical values.

Simple Linear regression algorithm has mainly two objectives:

o Model the relationship between the two variables. Such as the relationship between
Income and expenditure, experience and Salary, etc.
o Forecasting new observations. Such as Weather forecasting according to
temperature, Revenue of a company according to the investments in a year, etc.

Simple Linear Regression Model:

The Simple Linear Regression model can be represented using the below equation:

y= a0+a1x+ ε

Where,

a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is increasing or
decreasing.
ε = The error term. (For a good model it will be negligible)

Implementation of Simple Linear Regression Algorithm using Python

Problem Statement example for Simple Linear Regression:

Here we are taking a dataset that has two variables: salary (dependent variable) and
experience (Independent variable). The goals of this problem is:

o We want to find out if there is any correlation between these two variables
o We will find the best fit line for the dataset.
o How the dependent variable is changing by changing the dependent variable.

In this section, we will create a Simple Linear Regression model to find out the best fitting
line for representing the relationship between these two variables.

To implement the Simple Linear regression model in machine learning using Python, we
need to follow the below steps:

Step-1: Data Pre-processing

The first step for creating the Simple Linear Regression model is data pre-processing. We
have already done it earlier in this tutorial. But there will be some changes, which are given
in the below steps:

o First, we will import the three important libraries, which will help us for loading the
dataset, plotting the graphs, and creating the Simple Linear Regression model.

1. import numpy as nm
2. import matplotlib.pyplot as mtp
3. import pandas as pd
o Next, we will load the dataset into our code:

1. data_set= pd.read_csv('Salary_Data.csv')

By executing the above line of code (ctrl+ENTER), we can read the dataset on our Spyder
IDE screen by clicking on the variable explorer option.
The above output shows the dataset, which has two variables: Salary and Experience.

Note: In Spyder IDE, the folder containing the code file must be saved as a working
directory, and the dataset or csv file should be in the same folder.
o After that, we need to extract the dependent and independent variables from the given
dataset. The independent variable is years of experience, and the dependent variable is
salary. Below is code for it:

1. x= data_set.iloc[:, :-1].values
2. y= data_set.iloc[:, 1].values

In the above lines of code, for x variable, we have taken -1 value since we want to remove the
last column from the dataset. For y variable, we have taken 1 value as a parameter, since we
want to extract the second column and indexing starts from the zero.

By executing the above line of code, we will get the output for X and Y variable as:
In the above output image, we can see the X (independent) variable and Y (dependent)
variable has been extracted from the given dataset.

o Next, we will split both variables into the test set and training set. We have 30
observations, so we will take 20 observations for the training set and 10 observations
for the test set. We are splitting our dataset so that we can train our model using a
training dataset and then test the model using a test dataset. The code for this is given
below:

1. # Splitting the dataset into training and test set.


2. from sklearn.model_selection import train_test_split
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 1/3, random_state=0)

By executing the above code, we will get x-test, x-train and y-test, y-train dataset. Consider
the below images:

Test-dataset:
Training Dataset:
o For simple linear Regression, we will not use Feature Scaling. Because Python
libraries take care of it for some cases, so we don't need to perform it here. Now, our
dataset is well prepared to work on it and we are going to start building a Simple
Linear Regression model for the given problem.

Step-2: Fitting the Simple Linear Regression to the Training Set:

Now the second step is to fit our model to the training dataset. To do so, we will import
the LinearRegression class of the linear_model library from the scikit learn. After
importing the class, we are going to create an object of the class named as a regressor. The
code for this is given below:

1. #Fitting the Simple Linear Regression model to the training dataset


2. from sklearn.linear_model import LinearRegression
3. regressor= LinearRegression()
4. regressor.fit(x_train, y_train)
In the above code, we have used a fit() method to fit our Simple Linear Regression object to
the training set. In the fit() function, we have passed the x_train and y_train, which is our
training dataset for the dependent and an independent variable. We have fitted our regressor
object to the training set so that the model can easily learn the correlations between the
predictor and target variables. After executing the above lines of code, we will get the below
output.

Output:

Out[7]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Step: 3. Prediction of test set result:

dependent (salary) and an independent variable (Experience). So, now, our model is ready to
predict the output for the new observations. In this step, we will provide the test dataset (new
observations) to the model to check whether it can predict the correct output or not.

We will create a prediction vector y_pred, and x_pred, which will contain predictions of test
dataset, and prediction of training set respectively.

1. #Prediction of Test and Training set result


2. y_pred= regressor.predict(x_test)
3. x_pred= regressor.predict(x_train)

On executing the above lines of code, two variables named y_pred and x_pred will generate
in the variable explorer options that contain salary predictions for the training set and test set.

Output:

You can check the variable by clicking on the variable explorer option in the IDE, and also
compare the result by comparing values from y_pred and y_test. By comparing these values,
we can check how good our model is performing.

Step: 4. visualizing the Training set results:

Now in this step, we will visualize the training set result. To do so, we will use the scatter()
function of the pyplot library, which we have already imported in the pre-processing step.
The scatter () function will create a scatter plot of observations.

In the x-axis, we will plot the Years of Experience of employees and on the y-axis, salary of
employees. In the function, we will pass the real values of training set, which means a year of
experience x_train, training set of Salaries y_train, and color of the observations. Here we are
taking a green color for the observation, but it can be any color as per the choice.

Now, we need to plot the regression line, so for this, we will use the plot() function of the
pyplot library. In this function, we will pass the years of experience for training set, predicted
salary for training set x_pred, and color of the line.
Next, we will give the title for the plot. So here, we will use the title() function of
the pyplot library and pass the name ("Salary vs Experience (Training Dataset)".

After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel() function.

Finally, we will represent all above things in a graph using show(). The code is given below:

1. mtp.scatter(x_train, y_train, color="green")


2. mtp.plot(x_train, x_pred, color="red")
3. mtp.title("Salary vs Experience (Training Dataset)")
4. mtp.xlabel("Years of Experience")
5. mtp.ylabel("Salary(In Rupees)")
6. mtp.show()

Output:

By executing the above lines of code, we will get the below graph plot as an output.

In the above plot, we can see the real values observations in green dots and predicted values
are covered by the red regression line. The regression line shows a correlation between the
dependent and independent variable.

The good fit of the line can be observed by calculating the difference between actual values
and predicted values. But as we can see in the above plot, most of the observations are
close to the regression line, hence our model is good for the training set.
Step: 5. visualizing the Test set results:

In the previous step, we have visualized the performance of our model on the training set.
Now, we will do the same for the Test set. The complete code will remain the same as the
above code, except in this, we will use x_test, and y_test instead of x_train and y_train.

Here we are also changing the color of observations and regression line to differentiate
between the two plots, but it is optional.

1. #visualizing the Test set results


2. mtp.scatter(x_test, y_test, color="blue")
3. mtp.plot(x_train, x_pred, color="red")
4. mtp.title("Salary vs Experience (Test Dataset)")
5. mtp.xlabel("Years of Experience")
6. mtp.ylabel("Salary(In Rupees)")
7. mtp.show()

Output:

By executing the above line of code, we will get the output as:

In the above plot, there are observations given by the blue color, and prediction is given by
the red regression line. As we can see, most of the observations are close to the regression
line, hence we can say our Simple Linear Regression is a good model and able to make good
predictions.
Multiple Linear Regression

In the previous topic, we have learned about Simple Linear Regression, where a single
Independent/Predictor(X) variable is used to model the response variable (Y). But there may
be various cases in which the response variable is affected by more than one predictor
variable; for such cases, the Multiple Linear Regression algorithm is used.

Moreover, Multiple Linear Regression is an extension of Simple Linear regression as it takes


more than one predictor variable to predict the response variable. We can define it as:

Multiple Linear Regression is one of the important regression algorithms which models the
linear relationship between a single dependent continuous variable and more than one
independent variable.

Example:

Prediction of CO2 emission based on engine size and number of cylinders in a car.

Some key points about MLR:

o For MLR, the dependent or target variable(Y) must be the continuous/real, but the
predictor or independent variable may be of continuous or categorical form.
o Each feature variable must model the linear relationship with the dependent variable.
o MLR tries to fit a regression line through a multidimensional space of data-points.

MLR equation:

In Multiple Linear Regression, the target variable(Y) is a linear combination of multiple


predictor variables x1, x2, x3, ...,xn. Since it is an enhancement of Simple Linear Regression, so
the same is applied for the multiple linear regression equation, the equation becomes:

1.

Y= b<sub>0</sub>+b<sub>1</sub>x<sub>1</sub>+ b<sub>2</sub>x<sub>2</sub>+ b<su


b>3</sub>x<sub>3</sub>+...... bnxn ............... (a)

Where,

Y= Output/Response variable

b0, b1, b2, b3 , bn....= Coefficients of the model.

x1, x2, x3, x4,...= Various Independent/feature variable

Assumptions for Multiple Linear Regression:


o A linear relationship should exist between the Target and predictor variables.
o The regression residuals must be normally distributed.
o MLR assumes little or no multicollinearity (correlation between the independent
variable) in data.

Implementation of Multiple Linear Regression model using Python:

To implement MLR using Python, we have below problem:

Problem Description:

We have a dataset of 50 start-up companies. This dataset contains five main


information: R&D Spend, Administration Spend, Marketing Spend, State, and Profit for
a financial year. Our goal is to create a model that can easily determine which company has
a maximum profit, and which is the most affecting factor for the profit of a company.

Since we need to find the Profit, so it is the dependent variable, and the other four variables
are independent variables. Below are the main steps of deploying the MLR model:

1. Data Pre-processing Steps


2. Fitting the MLR model to the training set
3. Predicting the result of the test set

Step-1: Data Pre-processing Step:

The very first step is data pre-processing, which we have already discussed in this tutorial.
This process contains the below steps:

o Importing libraries: Firstly we will import the library which will help in building the
model. Below is the code for it:

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
o Importing dataset: Now we will import the dataset(50_CompList), which contains
all the variables. Below is the code for it:

1. #importing datasets
2. data_set= pd.read_csv('50_CompList.csv')

Output: We will get the dataset as:


In above output, we can clearly see that there are five variables, in which four variables are
continuous and one is categorical variable.

o Extracting dependent and independent Variables:

1. #Extracting Independent and dependent Variable


2. x= data_set.iloc[:, :-1].values
3. y= data_set.iloc[:, 4].values

Output:

Out[5]:

array([[165349.2, 136897.8, 471784.1, 'New York'],


[162597.7, 151377.59, 443898.53, 'California'],
[153441.51, 101145.55, 407934.54, 'Florida'],
[144372.41, 118671.85, 383199.62, 'New York'],
[142107.34, 91391.77, 366168.42, 'Florida'],
[131876.9, 99814.71, 362861.36, 'New York'],
[134615.46, 147198.87, 127716.82, 'California'],
[130298.13, 145530.06, 323876.68, 'Florida'],
[120542.52, 148718.95, 311613.29, 'New York'],
[123334.88, 108679.17, 304981.62, 'California'],
[101913.08, 110594.11, 229160.95, 'Florida'],
[100671.96, 91790.61, 249744.55, 'California'],
[93863.75, 127320.38, 249839.44, 'Florida'],
[91992.39, 135495.07, 252664.93, 'California'],
[119943.24, 156547.42, 256512.92, 'Florida'],
[114523.61, 122616.84, 261776.23, 'New York'],
[78013.11, 121597.55, 264346.06, 'California'],
[94657.16, 145077.58, 282574.31, 'New York'],
[91749.16, 114175.79, 294919.57, 'Florida'],
[86419.7, 153514.11, 0.0, 'New York'],
[76253.86, 113867.3, 298664.47, 'California'],
[78389.47, 153773.43, 299737.29, 'New York'],
[73994.56, 122782.75, 303319.26, 'Florida'],
[67532.53, 105751.03, 304768.73, 'Florida'],
[77044.01, 99281.34, 140574.81, 'New York'],
[64664.71, 139553.16, 137962.62, 'California'],
[75328.87, 144135.98, 134050.07, 'Florida'],
[72107.6, 127864.55, 353183.81, 'New York'],
[66051.52, 182645.56, 118148.2, 'Florida'],
[65605.48, 153032.06, 107138.38, 'New York'],
[61994.48, 115641.28, 91131.24, 'Florida'],
[61136.38, 152701.92, 88218.23, 'New York'],
[63408.86, 129219.61, 46085.25, 'California'],
[55493.95, 103057.49, 214634.81, 'Florida'],
[46426.07, 157693.92, 210797.67, 'California'],
[46014.02, 85047.44, 205517.64, 'New York'],
[28663.76, 127056.21, 201126.82, 'Florida'],
[44069.95, 51283.14, 197029.42, 'California'],
[20229.59, 65947.93, 185265.1, 'New York'],
[38558.51, 82982.09, 174999.3, 'California'],
[28754.33, 118546.05, 172795.67, 'California'],
[27892.92, 84710.77, 164470.71, 'Florida'],
[23640.93, 96189.63, 148001.11, 'California'],
[15505.73, 127382.3, 35534.17, 'New York'],
[22177.74, 154806.14, 28334.72, 'California'],
[1000.23, 124153.04, 1903.93, 'New York'],
[1315.46, 115816.21, 297114.46, 'Florida'],
[0.0, 135426.92, 0.0, 'California'],
[542.05, 51743.15, 0.0, 'New York'],
[0.0, 116983.8, 45173.06, 'California']], dtype=object)

As we can see in the above output, the last column contains categorical variables which are
not suitable to apply directly for fitting the model. So we need to encode this variable.

Encoding Dummy Variables:


As we have one categorical variable (State), which cannot be directly applied to the model, so
we will encode it. To encode the categorical variable into numbers, we will use
the LabelEncoder class. But it is not sufficient because it still has some relational order,
which may create a wrong model. So in order to remove this problem, we will
use OneHotEncoder, which will create the dummy variables. Below is code for it:

1. #Catgorical data
2. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
3. labelencoder_x= LabelEncoder()
4. x[:, 3]= labelencoder_x.fit_transform(x[:,3])
5. onehotencoder= OneHotEncoder(categorical_features= [3])
6. x= onehotencoder.fit_transform(x).toarray()

Here we are only encoding one independent variable, which is state as other variables are
continuous.

Output:
As we can see in the above output, the state column has been converted into dummy variables
(0 and 1). Here each dummy variable column is corresponding to the one State. We can
check by comparing it with the original dataset. The first column corresponds to
the California State, the second column corresponds to the Florida State, and the third
column corresponds to the New York State.

Note: We should not use all the dummy variables at the same time, so it must be 1 less than
the total number of dummy variables, else it will create a dummy variable trap.
o Now, we are writing a single line of code just to avoid the dummy variable trap:

1. #avoiding the dummy variable trap:


2. x = x[:, 1:]

If we do not remove the first dummy variable, then it may introduce multicollinearity in the
model.

As we can see in the above output image, the first column has been removed.
o Now we will split the dataset into training and test set. The code for this is given
below:

1. # Splitting the dataset into training and test set.


2. from sklearn.model_selection import train_test_split
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)

The above code will split our dataset into a training set and test set.

Output: The above code will split the dataset into training set and test set. You can check the
output by clicking on the variable explorer option given in Spyder IDE. The test set and
training set will look like the below image:

Test set:

Training set:
Note: In MLR, we will not do feature scaling as it is taken care by the library, so we don't
need to do it manually.
Step: 2- Fitting our MLR model to the Training set:

Now, we have well prepared our dataset in order to provide training, which means we will fit
our regression model to the training set. It will be similar to as we did in Simple Linear
Regression model. The code for this will be:

1. #Fitting the MLR model to the training set:


2. from sklearn.linear_model import LinearRegression
3. regressor= LinearRegression()
4. regressor.fit(x_train, y_train)

Output:

Out[9]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Now, we have successfully trained our model using the training dataset. In the next step, we
will test the performance of the model using the test dataset.
Step: 3- Prediction of Test set results:

The last step for our model is checking the performance of the model. We will do it by
predicting the test set result. For prediction, we will create a y_pred vector. Below is the
code for it:

1. #Predicting the Test set result;


2. y_pred= regressor.predict(x_test)

By executing the above lines of code, a new vector will be generated under the variable
explorer option. We can test our model by comparing the predicted values and test set values.

Output:

In the above output, we have predicted result set and test set. We can check model
performance by comparing these two value index by index. For example, the first index has a
predicted value of 103015$ profit and test/real value of 103282$ profit. The difference is only
of 267$, which is a good prediction, so, finally, our model is completed here.
o We can also check the score for training dataset and test dataset. Below is the code for
it:

1. print('Train Score: ', regressor.score(x_train, y_train))


2. print('Test Score: ', regressor.score(x_test, y_test))

Output: The score is:

Train Score: 0.9501847627493607


Test Score: 0.9347068473282446
Applications of Multiple Linear Regression:

There are mainly two applications of Multiple Linear Regression:

o Effectiveness of Independent variable on prediction:


o Predicting the impact of changes:

ML Polynomial Regression

o Polynomial Regression is a regression algorithm that models the relationship between


a dependent(y) and independent variable(x) as nth degree polynomial. The
Polynomial Regression equation is given below:

y= b0+b1x1+ b2x12+ b2x13+...... bnx1n


o It is also called the special case of Multiple Linear Regression in ML. Because we add
some polynomial terms to the Multiple Linear regression equation to convert it into
Polynomial Regression.
o It is a linear model with some modification in order to increase the accuracy.
o The dataset used in Polynomial regression for training is of non-linear nature.
o It makes use of a linear regression model to fit the complicated and non-linear
functions and datasets.
o Hence, "In Polynomial regression, the original features are converted into
Polynomial features of required degree (2,3,..,n) and then modeled using a linear
model."

Need for Polynomial Regression:

The need of Polynomial Regression in ML can be understood in the below points:


o If we apply a linear model on a linear dataset, then it provides us a good result as we
have seen in Simple Linear Regression, but if we apply the same model without any
modification on a non-linear dataset, then it will produce a drastic output. Due to
which loss function will increase, the error rate will be high, and accuracy will be
decreased.
o So for such cases, where data points are arranged in a non-linear fashion, we
need the Polynomial Regression model. We can understand it in a better way using
the below comparison diagram of the linear dataset and non-linear dataset.

o In the above image, we have taken a dataset which is arranged non-linearly. So if we


try to cover it with a linear model, then we can clearly see that it hardly covers any
data point. On the other hand, a curve is suitable to cover most of the data points,
which is of the Polynomial model.
o Hence, if the datasets are arranged in a non-linear fashion, then we should use the
Polynomial Regression model instead of Simple Linear Regression.

Note: A Polynomial Regression algorithm is also called Polynomial Linear Regression


because it does not depend on the variables, instead, it depends on the coefficients, which
are arranged in a linear fashion.
Equation of the Polynomial Regression Model:

Simple Linear Regression equation: y = b0+b1x .........(a)

Multiple Linear Regression equation: y= b0+b1x+ b2x2+ b3x3+....+ bnxn .........(b)

Polynomial Regression equation: y= b0+b1x + b2x2+ b3x3+....+ bnxn ..........(c)


When we compare the above three equations, we can clearly see that all three equations are
Polynomial equations but differ by the degree of variables. The Simple and Multiple Linear
equations are also Polynomial equations with a single degree, and the Polynomial regression
equation is Linear equation with the nth degree. So if we add a degree to our linear equations,
then it will be converted into Polynomial Linear equations.

Note: To better understand Polynomial Regression, you must have knowledge of Simple
Linear Regression.
Implementation of Polynomial Regression using Python:

Here we will implement the Polynomial Regression using Python. We will understand it by
comparing Polynomial Regression model with the Simple Linear Regression model. So first,
let's understand the problem for which we are going to build the model.

Problem Description: There is a Human Resource company, which is going to hire a new
candidate. The candidate has told his previous salary 160K per annum, and the HR have to
check whether he is telling the truth or bluff. So to identify this, they only have a dataset of
his previous company in which the salaries of the top 10 positions are mentioned with their
levels. By checking the dataset available, we have found that there is a non-linear
relationship between the Position levels and the salaries. Our goal is to build a Bluffing
detector regression model, so HR can hire an honest candidate. Below are the steps to build
such a model.

Steps for Polynomial Regression:

The main steps involved in Polynomial Regression are given below:

o Data Pre-processing
o Build a Linear Regression model and fit it to the dataset
o Build a Polynomial Regression model and fit it to the dataset
o Visualize the result for Linear Regression and Polynomial Regression model.
o Predicting the output.

Note: Here, we will build the Linear regression model as well as Polynomial Regression to
see the results between the predictions. And Linear regression model is for reference.

Data Pre-processing Step:

The data pre-processing step will remain the same as in previous regression models, except
for some changes. In the Polynomial Regression model, we will not use feature scaling, and
also we will not split our dataset into training and test set. It has two reasons:

o The dataset contains very less information which is not suitable to divide it into a test
and training set, else our model will not be able to find the correlations between the
salaries and levels.
o In this model, we want very accurate predictions for salary, so the model should have
enough information.

The code for pre-processing step is given below:

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('Position_Salaries.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, 1:2].values
11. y= data_set.iloc[:, 2].values

Explanation:

o In the above lines of code, we have imported the important Python libraries to import
dataset and operate on it.
o Next, we have imported the dataset 'Position_Salaries.csv', which contains three
columns (Position, Levels, and Salary), but we will consider only two columns
(Salary and Levels).
o After that, we have extracted the dependent(Y) and independent variable(X) from the
dataset. For x-variable, we have taken parameters as [:,1:2], because we want 1
index(levels), and included :2 to make it as a matrix.

Output:

By executing the above code, we can read our dataset as:

As we can see in the above output, there are three columns present (Positions, Levels, and
Salaries). But we are only considering two columns because Positions are equivalent to the
levels or may be seen as the encoded form of Positions.

Here we will predict the output for level 6.5 because the candidate has 4+ years' experience
as a regional manager, so he must be somewhere between levels 7 and 6.

Building the Linear regression model:

Now, we will build and fit the Linear regression model to the dataset. In building polynomial
regression, we will take the Linear regression model as reference and compare both the
results. The code is given below:

1. #Fitting the Linear Regression to the dataset


2. from sklearn.linear_model import LinearRegression
3. lin_regs= LinearRegression()
4. lin_regs.fit(x,y)

In the above code, we have created the Simple Linear model using lin_regs object
of LinearRegression class and fitted it to the dataset variables (x and y).

Output:

Out[5]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Building the Polynomial regression model:

Now we will build the Polynomial Regression model, but it will be a little different from the
Simple Linear model. Because here we will use PolynomialFeatures class
of preprocessing library. We are using this class to add some extra features to our dataset.

1. #Fitting the Polynomial regression to the dataset


2. from sklearn.preprocessing import PolynomialFeatures
3. poly_regs= PolynomialFeatures(degree= 2)
4. x_poly= poly_regs.fit_transform(x)
5. lin_reg_2 =LinearRegression()
6. lin_reg_2.fit(x_poly, y)

In the above lines of code, we have used poly_regs.fit_transform(x), because first we are
converting our feature matrix into polynomial feature matrix, and then fitting it to the
Polynomial regression model. The parameter value(degree= 2) depends on our choice. We
can choose it according to our Polynomial features.

After executing the code, we will get another matrix x_poly, which can be seen under the
variable explorer option:
Next, we have used another LinearRegression object, namely lin_reg_2, to fit
our x_poly vector to the linear model.

Output:

Out[11]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,


normalize=False)

Visualizing the result for Linear regression:

Now we will visualize the result for Linear regression model as we did in Simple Linear
Regression. Below is the code for it:

1. #Visulaizing the result for Linear Regression model


2. mtp.scatter(x,y,color="blue")
3. mtp.plot(x,lin_regs.predict(x), color="red")
4. mtp.title("Bluff detection model(Linear Regression)")
5. mtp.xlabel("Position Levels")
6. mtp.ylabel("Salary")
7. mtp.show()

Output:

In the above output image, we can clearly see that the regression line is so far from the
datasets. Predictions are in a red straight line, and blue points are actual values. If we
consider this output to predict the value of CEO, it will give a salary of approx. 600000$,
which is far away from the real value.

So we need a curved model to fit the dataset other than a straight line.

Visualizing the result for Polynomial Regression

Here we will visualize the result of Polynomial regression model, code for which is little
different from the above model.

Code for this is given below:

1. #Visulaizing the result for Polynomial Regression


2. mtp.scatter(x,y,color="blue")
3. mtp.plot(x, lin_reg_2.predict(poly_regs.fit_transform(x)), color="red")
4. mtp.title("Bluff detection model(Polynomial Regression)")
5. mtp.xlabel("Position Levels")
6. mtp.ylabel("Salary")
7. mtp.show()

In the above code, we have taken lin_reg_2.predict(poly_regs.fit_transform(x), instead of


x_poly, because we want a Linear regressor object to predict the polynomial features matrix.

Output:
As we can see in the above output image, the predictions are close to the real values. The
above plot will vary as we will change the degree.

For degree= 3:

If we change the degree=3, then we will give a more accurate plot, as shown in the below
image.

SO as we can see here in the above output image, the predicted salary for level 6.5 is near to
170K$-190k$, which seems that future employee is saying the truth about his salary.

Degree= 4: Let's again change the degree to 4, and now will get the most accurate plot.
Hence we can get more accurate results by increasing the degree of Polynomial.
Predicting the final result with the Linear Regression model:

Now, we will predict the final output using the Linear regression model to see whether an
employee is saying truth or bluff. So, for this, we will use the predict() method and will pass
the value 6.5. Below is the code for it:

1. lin_pred = lin_regs.predict([[6.5]])
2. print(lin_pred)

Output:

[330378.78787879]

Predicting the final result with the Polynomial Regression model:

Now, we will predict the final output using the Polynomial Regression model to compare
with Linear model. Below is the code for it:

1. poly_pred = lin_reg_2.predict(poly_regs.fit_transform([[6.5]]))
2. print(poly_pred)

Output:

[158862.45265153]

As we can see, the predicted output for the Polynomial Regression is [158862.45265153],
which is much closer to real value hence, we can say that future employee is saying true.
Evaluate Regression Model?

Model evaluation is very important in data science. It helps you to understand the performance
of your model and makes it easy to present your model to other people. There are many
different evaluation metrics out there but only some of them are suitable to be used for
regression. This article will cover the different metrics for regression model and difference
between them. Hopefully after you read this post, you are clear on which metrics to apply to
your future regression model.

Every time when I tell my friends: “Hey, I have built a machine learning model to predict
XXX.” Their first reaction would be: “Cool, so what is the accuracy of your model
prediction?” Well, unlike classification, accuracy in regression model is slightly harder to
illustrate. It is impossible for you to predict the exact value but rather how close your
prediction is against the real value.

There are 3 main metrics for model evaluation in regression:

1. R Square/Adjusted R Square

2. Mean Square Error(MSE)/Root Mean Square Error(RMSE)

3. Mean Absolute Error(MAE)

R Square/Adjusted R Square

R Square measures how much of variability in dependent variable can be explained by the
model. It is square of Correlation Coefficient(R) and that is why it is called R Square.
R Square is calculated by the sum of squared of prediction error divided by the total sum of
square which replace the calculated prediction with mean. R Square value is between 0 to 1
and bigger value indicates a better fit between prediction and actual value.

R Square is a good measure to determine how well the model fits the dependent
variables. However, it does not take into consideration of overfitting problem. If your
regression model has many independent variables, because the model is too complicated, it
may fit very well to the training data but performs badly for testing data. That is why Adjusted
R Square is introduced because it will penalise additional independent variables added to the
model and adjust the metric to prevent overfitting issue.
#Example on R_Square and Adjusted R Square
import statsmodels.api as sm
X_addC = sm.add_constant(X)
result = sm.OLS(Y, X_addC).fit()
print(result.rsquared, result.rsquared_adj)
# 0.79180307318 0.790545085707

In Python, you can calculate R Square using Statsmodel or Sklearn Package

From the sample model, we can interpret that around 79% of dependent variability can be
explain by the model and adjusted R Square is roughly the same as R Square meaning the
model is quite robust.

Mean Square Error(MSE)/Root Mean Square Error(RMSE)


While R Square is a relative measure of how well the model fits dependent variables, Mean
Square Error is an absolute measure of the goodness for the fit.

MSE is calculated by the sum of square of prediction error which is real output minus
predicted output and then divide by the number of data points. It gives you an absolute number
on how much your predicted results deviate from the actual number. You cannot interpret
much insights from one single result but it gives you an real number to compare against other
model results and help you select the best regression model.

Root Mean Square Error(RMSE) is the square root of MSE. It is used more commonly than
MSE because firstly sometimes MSE value can be too big to compare easily. Secondly, MSE
is calculated by the square of error, and thus square root brings it back to the same level of
prediction error and make it easier for interpretation.
from sklearn.metrics import mean_squared_error
import math
print(mean_squared_error(Y_test, Y_predicted))
print(math.sqrt(mean_squared_error(Y_test, Y_predicted)))
# MSE: 2017904593.23
# RMSE: 44921.092965684235

MSE can be calculated in Python using Sklearn Package

Mean Absolute Error(MAE)


Mean Absolute Error(MAE) is similar to Mean Square Error(MSE). However, instead of the
sum of square of error in MSE, MAE is taking the sum of absolute value of error.

Compare to MSE or RMSE, MAE is a more direct representation of sum of error terms. MSE
gives larger penalisation to big prediction error by square it while MAE treats all errors
the same.
from sklearn.metrics import mean_absolute_error
print(mean_absolute_error(Y_test, Y_predicted))
#MAE: 26745.1109986

MAE can be calculated in Python using Sklearn Package

Overall Recommendation/Conclusion

R Square/Adjusted R Square are better used to explain the model to other people because you
can explain the number as a percentage of the output variability. MSE, RMSE or MAE are
better to be used to compare performance between different regression models. Personally, I
would prefer using RMSE and I think Kaggle also uses it to assess submission. However, it
makes total sense to use MSE if value is not too big and MAE if you do not want to penalize
large prediction error.

Adjusted R square is the only metric here that considers overfitting problem. R Square has
direct library in Python to calculate but I did not find a direct library to calculate Adjusted R
square except using the statsmodel results. If you really want to calculate Adjusted R Square,
you can use statsmodel or use its mathematic formula directly.

You might also like