Exercise 6 : Clusters and Anomalies
Essential Libraries
Let us begin by importing the essential Python Libraries.
NumPy : Library for Numeric Computations in Python
Pandas : Library for Data Acquisition and Preparation
Matplotlib : Low-level library for Data Visualization
Seaborn : Higher-level library for Data Visualization
# % matplotlib inline will produce a figure immediately below
## Matplotlib Inline command is a magic command that makes the plots
generated by matplotlib show into the IPython shell that we are
running and not in a separate output window.
# # This can be omitted for some latest versions of Jupyter-Notebook
since "inline" is the default backend for them.
%matplotlib inline
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics
Setup : Import the Dataset
Dataset from Kaggle : The "House Prices" competition
Source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
The dataset is train.csv; hence we use the read_csv function from Pandas.
Immediately after importing, take a quick look at the data using the head function.
houseData = pd.read_csv('train.csv')
houseData.head()
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
\
0 1 60 RL 65.0 8450 Pave NaN Reg
1 2 20 RL 80.0 9600 Pave NaN Reg
2 3 60 RL 68.0 11250 Pave NaN IR1
3 4 70 RL 60.0 9550 Pave NaN IR1
4 5 60 RL 84.0 14260 Pave NaN IR1
LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal
MoSold \
0 Lvl AllPub ... 0 NaN NaN NaN 0
2
1 Lvl AllPub ... 0 NaN NaN NaN 0
5
2 Lvl AllPub ... 0 NaN NaN NaN 0
9
3 Lvl AllPub ... 0 NaN NaN NaN 0
2
4 Lvl AllPub ... 0 NaN NaN NaN 0
12
YrSold SaleType SaleCondition SalePrice
0 2008 WD Normal 208500
1 2007 WD Normal 181500
2 2008 WD Normal 223500
3 2006 WD Abnorml 140000
4 2008 WD Normal 250000
[5 rows x 81 columns]
Problem 1 : Clustering by Gr Living Area and Garage Area
Extract the required variables from the dataset, and then perform Bi-Variate Clustering.
# Extract the Features from the Data
X = pd.DataFrame(houseData[['GrLivArea','GarageArea']])
# Plot the Raw Data on a 2D grid
f, axes = plt.subplots(1, 1, figsize=(16,8))
plt.scatter(x = "GrLivArea", y = "GarageArea", data = X)
<matplotlib.collections.PathCollection at 0x1dfad8709a0>
Basic KMeans Clustering
Guess the number of clusters from the 2D plot, and perform KMeans Clustering.
We will use the KMeans clustering model from sklearn.cluster module.
# Import KMeans from sklearn.cluster
import warnings
warnings.filterwarnings("ignore",category=UserWarning,module="sklearn"
)
from sklearn.cluster import KMeans
# Guess the Number of Clusters
num_clust = 3
# Create Clustering Model using KMeans
kmeans = KMeans(n_clusters = num_clust,n_init=10)
# Fit the Clustering Model on the Data
kmeans.fit(X)
KMeans(n_clusters=3)
Print the Cluster Centers as Co-ordinates of Features
# Print the Cluster Centers
print("Features", "\tLiving", "\tGarage")
print()
for i, center in enumerate(kmeans.cluster_centers_):
print("Cluster", i, end=":\t")
for coord in center:
print(round(coord, 2), end="\t")
print()
Features Living Garage
Cluster 0: 2570.17 678.3
Cluster 1: 1086.74 375.3
Cluster 2: 1696.92 522.68
Labeling the Clusters in the Data
We may use the model on the data to predict the clusters.
# Predict the Cluster Labels
labels = kmeans.predict(X)
# Append Labels to the Data
X_labeled = X.copy()
X_labeled["Cluster"] = pd.Categorical(labels)
# Summary of the Cluster Labels
sb.countplot(x=X_labeled["Cluster"])
<AxesSubplot: xlabel='Cluster', ylabel='count'>
# Visualize the Clusters in the Data
f, axes = plt.subplots(1, 1, figsize=(16,8))
sc = plt.scatter(x = "GrLivArea", y = "GarageArea", c = "Cluster",
cmap = 'viridis', data = X_labeled)
legend = axes.legend(*sc.legend_elements(), loc="lower right",
title="Classes")
Within Cluster Sum of Squares
WithinSS = 0 : Every data point is a cluster on its own
WithinSS = Variance : Whole dataset is a single cluster
# Print the Within Cluster Sum of Squares
print("Within Cluster Sum of Squares :", kmeans.inertia_)
Within Cluster Sum of Squares : 140805418.54656714
Discuss : Is this the optimal clustering that you will be happy with? If not, try changing
num_clust.
Anomaly Detection for the Dataset
Extract the required variables from the dataset, and then perform Bi-Variate Anomaly Detection.
# Extract the Features from the Data
X = pd.DataFrame(houseData[['GrLivArea','GarageArea']])
# Plot the Raw Data on a 2D grid
f, axes = plt.subplots(1, 1, figsize=(16,8))
plt.scatter(x = "GrLivArea", y = "GarageArea", data = X)
<matplotlib.collections.PathCollection at 0x1dfaf326910>
Basic Anomaly Detection
Use the Nearest Neighbors (k-NN) pattern-identification method for detecting Outliers and
Anomalies.
We will use the LocalOutlierFactor neighborhood model from sklearn.neighbors
module.
# Import LocalOutlierFactor from sklearn.neighbors
from sklearn.neighbors import LocalOutlierFactor
# Set the Parameters for Neighborhood
num_neighbors = 20 # Number of Neighbors
cont_fraction = 0.05 # Fraction of Anomalies
# Create Anomaly Detection Model using LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors = num_neighbors, contamination =
cont_fraction)
# Fit the Model on the Data and Predict Anomalies
lof.fit(X)
LocalOutlierFactor(contamination=0.05)
Labeling the Anomalies in the Data
We may use the model on the data to predict the anomalies.
# Predict the Anomalies
labels = lof.fit_predict(X)
# Append Labels to the Data
X_labeled = X.copy()
X_labeled["Anomaly"] = pd.Categorical(labels)
# Summary of the Anomaly Labels
sb.countplot(x=X_labeled["Anomaly"])
<AxesSubplot: xlabel='Anomaly', ylabel='count'>
# Visualize the Anomalies in the Data
f, axes = plt.subplots(1, 1, figsize=(16,8))
plt.scatter(x = "GrLivArea", y = "GarageArea", c = "Anomaly", cmap =
'viridis', data = X_labeled)
<matplotlib.collections.PathCollection at 0x1dfb0c12e50>
Discuss : Is this the optimal anomaly detection that you will be happy with? If not, try changing
parameters.