Lab 3 Machine Learning
Nikhilesh Prabhakar
16BCE1158
Datasets used
Apart from the usual House Prices dataset that were used in previous
lab submissions, the dataset that I had worked on for this one was
the one presented in Sebastian Raschka’s “Python Machine
Learning”.
Link: https://www.kaggle.com/c/house-prices-advanced-regression-
techniques/data. There are 79 variables describing almost every
aspect of residential homes for sale in Iowa (at the time of collecting
data).
Link for the second dataset:
https://archive.ics.uci.edu/ml/machine-
learningdatabases/housing/housing.data
Methodology
Step 1: Importing the required Packages
import pandas as pd
import matplotlib as plot
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
import seaborn as sb
Step 2: Cleaning the data
This was covered in Lab 1’s submission. Additionally all string
columns were converted to integer by using Panda’s function
get_dummies() which separates all unique string values by making
them separate columns
Step 3: Using a heatmap
The Pearson’s correlation coefficient is figured out between Sale
Price and the other columns in the dataset. The columns that had
absolute correlation above 0.5 were plotted on the map as shown.
corrmat = fdata.corr()
top_corr_features =
corrmat.index[abs(corrmat["SalePrice"])>0.50]
plt.figure(figsize=(10,10))
g =
sb.heatmap(fdata[top_corr_features].corr(),annot=True)
#Sale Price is most correlated with OverallQual,
GrLivArea, GarageCars, GarageArea, TotalBsmtSF, 1stFlrSF
Step 4: The Linear Regression Model
There are 3 methods covered in this lecture to for finding out the
Linear Regression
Method 1: OLS (Ordinary Least Squares)
from statsmodels.formula.api import ols
from IPython.display import HTML, display
housing_model = ols("SalePrice ~ OverallQual",
data=fdata).fit()
Adj. R-squared indicates that 62% of housing prices can be
explained by our predictor variable
The standard error measures the accuracy of OverallQual’s
coefficient by estimating the variation of the coefficient if the same
test were run on a different sample of our population. Our standard
error, 5756.407, is extremely and therefore a linear model doesn’t
suit our data.
A lot more was tried using OLS as can be seen in the python
notebook, even multiple regression’s statistical data can be viewed
through this.
Method 2: sklearn’s Linear Regression Function
from sklearn.linear_model import LinearRegression
X = fdata[["OverallQual"]]
Y = fdata[["SalePrice"]]
clf = LinearRegression()
clf.fit(X,Y)
Out[174]: LinearRegression(copy_X=True, fit_intercept=True,
n_jobs=1, normalize=False)
data_test = pd.read_csv("train.csv")
X_test = data_test[["OverallQual"]]
Y_test = data_test[["SalePrice"]]
clf.score(X_test,Y_test)
Out[179]: 0.62544678976769652
Similar to the Adj R-squared value seen in OLS
Y_pred = clf.predict(X_test)
plt.scatter(X_test, Y_test, color='black')
plt.plot(X_test, Y_pred, color='blue', linewidth=3)
#OverallQual is a discrete data
Same code was tried out with another attribute “GrLivArea” and this
was formed
Method 2: User-defined function given in Sebastian Raschka’s
book
class LinearRegressionGD(object):
def __init__(self, eta=0.001, n_iter=20):
self.eta = eta
self.n_iter = n_iter
def fit(self, X, y):
self.w_ = np.zeros(1 + X.shape[1])
self.cost_ = []
for i in range(self.n_iter):
output = self.net_input(X)
errors = (y - output)
b = self.eta * X.T.dot(errors)
self.w_[1:] += b
self.w_[0] += self.eta * errors.sum()
cost = (errors**2).sum() / 2.0
self.cost_.append(cost)
return self
def net_input(self, X):
return np.dot(X, self.w_[1:]) + self.w_[0]
def predict(self, X):
return self.net_input(X)
def lin_regplot(X, y, model):
plt.scatter(X, y, c='blue')
plt.plot(X, model.predict(X), color='red')
return None
lin_regplot(X_std, y_std, lr)
plt.xlabel('Average number of rooms [RM]
(standardized)')
plt.ylabel('Price in $1000\'s [MEDV] (standardized)')
plt.show()
slr = LinearRegression()
slr.fit(X, y)
print('Slope: %.3f' % slr.coef_[0])
print('Intercept: %.3f' % slr.intercept_)
Slope: 107.130
Intercept: 18569.026
Similar was done with “GrLivArea"
The same code was tried out with the example given in the textbook.
A python notebook along with snippets down below is provided for
proof of execution.