Experiment 1: Linear regression
Objectives:
1- Comprehend the concept of linear regression using the least square
method.
2- Understand the code in Python used for linear regression.
3- Run real-world applications in Python.
Introduction:
Least Square Method
The least square method is the process of finding a regression line or best-fitted
line for any data set that is described by an equation. This method requires
reducing the sum of the squares of the residual parts of the points from the curve
or line and the trend of outcomes is found quantitatively. The method of curve
fitting is seen while regression analysis and the fitting equations to derive the
curve is the least square method.
Least Square Method Definition
The least-squares method is a statistical method used to find the line of best fit of
the form of an equation such as y = mx + b to the given data. The curve of the
equation is called the regression line. Our main objective in this method is to
reduce the sum of the squares of errors as much as possible. This is the reason
this method is called the least-squares method. This method is often used in data
fitting where the best-fit result is assumed to reduce the sum of squared errors
that are considered to be the difference between the observed values and
corresponding fitted value. The sum of squared errors helps in finding the
variation in observed data. For example, we have 4 data points and using this
method we arrive at the following graph.
Figure 1
The two basic categories of least-square problems are ordinary or linear least
squares and nonlinear least squares.
Limitations for Least Square Method
Even though the least-squares method is considered the best method to find the
line of best fit, it has a few limitations. They are:
• This method exhibits only the relationship between the two variables.
All other causes and effects are not taken into consideration.
• This method is unreliable when data is not evenly distributed.
• This method is very sensitive to outliers. This can skew the results of
the least-squares analysis.
Least Square Method Graph
Look at the graph below, the straight line shows the potential relationship
between the independent variable and the dependent variable. The ultimate goal
of this method is to reduce this difference between the observed response and
the response predicted by the regression line. Less residual means that the
model fits better. The data points need to be minimized by the method of
reducing residuals of each point from the line. There are vertical residuals and
perpendicular residuals. Vertical is mostly used in polynomials and hyperplane
problems while perpendicular is used in general as seen in the image below.
Figure 2
Least Square Method Formula
Least-square method is the curve that best fits a set of observations with a
minimum sum of squared residuals or errors. Let us assume that the given points
of data are (x1, y1), (x2, y2), (x3, y3), …, (xn, yn) in which all x’s are independent
variables, while all y’s are dependent ones. This method is used to find
a linear line of the form y = mx + b, where y and x are variables, m is the slope,
and b is the y-intercept. The formula to calculate slope m and the value of b is
given by:
m = (n∑xy - ∑y∑x)/n∑x2 - (∑x)2
b = (∑y - m∑x)/n
Here, n is the number of data points.
Following are the steps to calculate the least square using the above formulas.
• Step 1: Draw a table with 4 columns where the first two columns are for x
and y points.
• Step 2: In the next two columns, find xy and (x)2.
• Step 3: Find ∑x, ∑y, ∑xy, and ∑(x)2.
• Step 4: Find the value of slope m using the above formula.
• Step 5: Calculate the value of b using the above formula.
• Step 6: Substitute the value of m and b in the equation y = mx + b
Let us look at an example to understand this better.
Example: Let's say we have data as shown below.
x 1 2 3 4 5
y 2 5 3 8 7
Solution: We will follow the steps to find the linear line.
x y xy x2
1 2 2 1
2 5 10 4
3 3 9 9
4 8 32 16
5 7 35 25
∑x =15 ∑y = 25 ∑xy = 88 ∑x2 = 55
Find the value of m by using the formula,
m = (n∑xy - ∑y∑x)/n∑x2 - (∑x)2
m = [(5×88) - (15×25)]/(5×55) - (15)2
m = (440 - 375)/(275 - 225)
m = 65/50 = 13/10
Find the value of b by using the formula,
b = (∑y - m∑x)/n
b = (25 - 1.3×15)/5
b = (25 - 19.5)/5
b = 5.5/5
So, the required equation of least squares is y = mx + b = 13/10x + 5.5/5.
Important Notes
• The least-squares method is used to predict the behavior of the
dependent variable concerning the independent variable.
• The sum of the squares of errors is called variance.
• The main aim of the least-squares method is to minimize the sum of the
squared errors.
Implementing the least-squares method using Python:
# Linear Regression implementation using Numpy
import matplotlib.pyplot as plt
import numpy as np
x=np.array([1,2,3,4,5,6])
y=np.array([2,2,4,4,6,6])
plt.scatter(x,y)
n=len(x)
xy=0
sumx=0
sumy=0
sq_sumx=0
xy=np.sum(x*y)
sumx=np.sum(x)
sumy=np.sum(y)
sq_sumx=np.sum(x*x)
b=(n*xy-sumx*sumy)/(n*sq_sumx-sumx**2)
print('b= ',b)
a=(sumy-b*sumx)/n
print('a= ',a)
print('The linear Regression equation is \ny=',a,'+',b,'x')
import matplotlib.pyplot as plt from scipy import stats
slope =b
intercept=a
def myfunc(x):
return slope * x + intercept
mymodel = list(map(myfunc, x))
plt.plot(x, mymodel)
plt.show()
Implementing regression on the real-world data set:
The below graph explains the relation between Salary and Years of Experience
Equation : y = mx + c
This is the simple linear regression equation where c is the constant and m is
the slope and describes the relationship between x (independent
variable) and y (dependent variable). The coefficient can be positive or negative
and is the degree of change in the dependent variable for every 1 unit of change in
the independent variable.
β0 (y-intercept) and β1 (slope) are the coefficients whose values represent the
accuracy of predicted values with the actual values.
Implement Simple Linear Regression in Python
In this example, we will use the salary data concerning the experience of
employees. In this dataset, we have two columns YearsExperience and Salary
Step 1: Import the required python packages
We need Pandas for data manipulation, NumPy for mathematical calculations,
and MatplotLib, and Seaborn for visualizations. Sklearn libraries are used for
machine learning operations
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from pandas.core.common import random_state
from sklearn.linear_model import LinearRegression
Step 2: Load the dataset
Download the dataset from here and upload it to your notebook and read it into
the pandas dataframe.
# Get dataset
df_sal = pd.read_csv('/content/Salary_Data.csv')
df_sal.head()
Step 3: Data analysis
Now that we have our data ready, let's analyze and understand its trend in detail.
To do that we can first describe the data below -
# Describe data
df_sal.describe()
Here, we can see Salary ranges from 37731 to 122391 and a median of 65237.
We can also find how the data is distributed visually using Seaborn distplot
# Data distribution
plt.title('Salary Distribution Plot')
sns.distplot(df_sal['Salary'])
plt.show()
A distplot or distribution plot shows the variation in the data distribution.
It represents the data by combining a line with a histogram.
Then we check the relationship between Salary and Experience -
# Relationship between Salary and Experience
plt.scatter(df_sal['YearsExperience'], df_sal['Salary'], color
= 'lightcoral')
plt.title('Salary vs Experience')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.box(False)
plt.show()
It is clearly visible now, our data varies linearly. That means, that an individual
receives more Salary as they gain Experience.
Step 4: Split the dataset into dependent/independent variables
Experience (X) is the independent variable
Salary (y) is dependent on experience
# Splitting variables
X = df_sal.iloc[:, :1] # independent
y = df_sal.iloc[:, 1:] # dependent
Step 4: Split data into Train/Test sets
Further, split your data into training (80%) and test (20%) sets
using train_test_split
# Splitting dataset into test/train
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.2, random_state = 0)
Step 5: Train the regression model
Pass the X_train and y_train data into the regressor model by regressor.fit to
train the model with our training data.
# Regressor model
regressor = LinearRegression()
regressor.fit(X_train, y_train)
Step 6: Predict the result
Here comes the interesting part, when we are all set and ready to predict any value
of y (Salary) dependent on X (Experience) with the trained model
using regressor.predict
# Prediction result
y_pred_test = regressor.predict(X_test) # predicted value
of y_test
y_pred_train = regressor.predict(X_train) # predicted value
of y_train
Step 7: Plot the training and test results
Its time to test our predicted results by plotting graphs
• Plot training set data vs predictions
First we plot the result of training sets (X_train,
y_train) with X_train and predicted value
of y_train (regressor.predict(X_train))
# Prediction on training set
plt.scatter(X_train, y_train, color = 'lightcoral')
plt.plot(X_train, y_pred_train, color = 'firebrick')
plt.title('Salary vs Experience (Training Set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend(['X_train/Pred(y_test)', 'X_train/y_train'], title
= 'Sal/Exp', loc='best', facecolor='white')
plt.box(False)
plt.show()
• Plot test set data vs predictions
Secondly, we plot the result of test sets (X_test,
y_test) with X_train and predicted value of y_train
(regressor.predict(X_train))
# Prediction on test set
plt.scatter(X_test, y_test, color = 'lightcoral')
plt.plot(X_train, y_pred_train, color = 'firebrick')
plt.title('Salary vs Experience (Test Set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend(['X_train/Pred(y_test)', 'X_train/y_train'], title
= 'Sal/Exp', loc='best', facecolor='white')
plt.box(False)
plt.show()
We can see, in both plots, the regressor line covers train and test data.
Also, you can plot results with the predicted value of y_test
(regressor.predict(X_test)) but the regression line would remain the same at
it is generated from the unique equation of linear regression with the same
training data.
If you remember from the beginning of this article, we discussed the linear
equation y = mx + c, we can also get the c (y-
intercept) and m (slope/coefficient) from the regressor model.
# Regressor coefficients and intercept
print(f'Coefficient: {regressor.coef_}')
print(f'Intercept: {regressor.intercept_}')
The fully implemented code:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from pandas.core.common import random_state
from sklearn.linear_model import LinearRegression
df_sal = pd.read_csv('D:\datasets/Salary_Data.csv')
df_sal.head()
# Describe data
df_sal.describe()
# Data distribution
plt.title('Salary Distribution Plot')
sns.distplot(df_sal['Salary'])
plt.show()
# Relationship between Salary and Experience
plt.scatter(df_sal['YearsExperience'], df_sal['Salary'], color = 'lightcoral')
plt.title('Salary vs Experience')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.box(False)
plt.show()
# Splitting variables
X = df_sal.iloc[:, :1] # independent
y = df_sal.iloc[:, 1:] # dependent
# Splitting dataset into test/train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state =
0)
# Regressor model
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Prediction result
y_pred_test = regressor.predict(X_test) # predicted value of y_test
y_pred_train = regressor.predict(X_train) # predicted value of y_train
# Prediction on training set
plt.scatter(X_train, y_train, color = 'lightcoral')
plt.plot(X_train, y_pred_train, color = 'firebrick')
plt.title('Salary vs Experience (Training Set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend(['X_train/Pred(y_test)', 'X_train/y_train'], title = 'Sal/Exp', loc='best',
facecolor='white')
plt.box(False)
plt.show()
# Prediction on test set
plt.scatter(X_test, y_test, color = 'lightcoral')
plt.plot(X_train, y_pred_train, color = 'firebrick')
plt.title('Salary vs Experience (Test Set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend(['X_train/Pred(y_test)', 'X_train/y_train'], title = 'Sal/Exp', loc='best',
facecolor='white')
plt.box(False)
plt.show()
# Regressor coefficients and intercept
print(f'Coefficient: {regressor.coef_}')
print(f'Intercept: {regressor.intercept_}')
TASK:
implement linear regression on one of the below datasets:
1. Cancer linear regression
2. CDC data: nutrition, physical activity, obesity
3. Fish market dataset for regression
4. Medical insurance costs
5. New York Stock Exchange dataset
6. OLS regression challenge
7. Real estate price prediction
8. Red wine quality
9. Vehicle dataset from CarDekho
10. WHO statistics on life expectancy