0% found this document useful (0 votes)
35 views8 pages

Exercise6 Solution

IE0005 Exercise solutions 6

Uploaded by

Derrick
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views8 pages

Exercise6 Solution

IE0005 Exercise solutions 6

Uploaded by

Derrick
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Exercise 6 : Clusters and Anomalies

Essential Libraries
Let us begin by importing the essential Python Libraries.

NumPy : Library for Numeric Computations in Python


Pandas : Library for Data Acquisition and Preparation
Matplotlib : Low-level library for Data Visualization
Seaborn : Higher-level library for Data Visualization

# % matplotlib inline will produce a figure immediately below


## Matplotlib Inline command is a magic command that makes the plots
generated by matplotlib show into the IPython shell that we are
running and not in a separate output window.
# # This can be omitted for some latest versions of Jupyter-Notebook
since "inline" is the default backend for them.
%matplotlib inline

# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

Setup : Import the Dataset


Dataset from Kaggle : The "House Prices" competition
Source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

The dataset is train.csv; hence we use the read_csv function from Pandas.
Immediately after importing, take a quick look at the data using the head function.

houseData = pd.read_csv('train.csv')
houseData.head()

Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape


\
0 1 60 RL 65.0 8450 Pave NaN Reg

1 2 20 RL 80.0 9600 Pave NaN Reg

2 3 60 RL 68.0 11250 Pave NaN IR1


3 4 70 RL 60.0 9550 Pave NaN IR1

4 5 60 RL 84.0 14260 Pave NaN IR1

LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal


MoSold \
0 Lvl AllPub ... 0 NaN NaN NaN 0
2
1 Lvl AllPub ... 0 NaN NaN NaN 0
5
2 Lvl AllPub ... 0 NaN NaN NaN 0
9
3 Lvl AllPub ... 0 NaN NaN NaN 0
2
4 Lvl AllPub ... 0 NaN NaN NaN 0
12

YrSold SaleType SaleCondition SalePrice


0 2008 WD Normal 208500
1 2007 WD Normal 181500
2 2008 WD Normal 223500
3 2006 WD Abnorml 140000
4 2008 WD Normal 250000

[5 rows x 81 columns]

Problem 1 : Clustering by Gr Living Area and Garage Area


Extract the required variables from the dataset, and then perform Bi-Variate Clustering.

# Extract the Features from the Data


X = pd.DataFrame(houseData[['GrLivArea','GarageArea']])

# Plot the Raw Data on a 2D grid


f, axes = plt.subplots(1, 1, figsize=(16,8))
plt.scatter(x = "GrLivArea", y = "GarageArea", data = X)

<matplotlib.collections.PathCollection at 0x1dfad8709a0>
Basic KMeans Clustering
Guess the number of clusters from the 2D plot, and perform KMeans Clustering.
We will use the KMeans clustering model from sklearn.cluster module.

# Import KMeans from sklearn.cluster


import warnings
warnings.filterwarnings("ignore",category=UserWarning,module="sklearn"
)
from sklearn.cluster import KMeans

# Guess the Number of Clusters


num_clust = 3

# Create Clustering Model using KMeans


kmeans = KMeans(n_clusters = num_clust,n_init=10)

# Fit the Clustering Model on the Data


kmeans.fit(X)

KMeans(n_clusters=3)

Print the Cluster Centers as Co-ordinates of Features

# Print the Cluster Centers


print("Features", "\tLiving", "\tGarage")
print()

for i, center in enumerate(kmeans.cluster_centers_):


print("Cluster", i, end=":\t")
for coord in center:
print(round(coord, 2), end="\t")
print()

Features Living Garage

Cluster 0: 2570.17 678.3


Cluster 1: 1086.74 375.3
Cluster 2: 1696.92 522.68

Labeling the Clusters in the Data


We may use the model on the data to predict the clusters.

# Predict the Cluster Labels


labels = kmeans.predict(X)

# Append Labels to the Data


X_labeled = X.copy()
X_labeled["Cluster"] = pd.Categorical(labels)

# Summary of the Cluster Labels


sb.countplot(x=X_labeled["Cluster"])

<AxesSubplot: xlabel='Cluster', ylabel='count'>


# Visualize the Clusters in the Data
f, axes = plt.subplots(1, 1, figsize=(16,8))
sc = plt.scatter(x = "GrLivArea", y = "GarageArea", c = "Cluster",
cmap = 'viridis', data = X_labeled)
legend = axes.legend(*sc.legend_elements(), loc="lower right",
title="Classes")

Within Cluster Sum of Squares


WithinSS = 0 : Every data point is a cluster on its own
WithinSS = Variance : Whole dataset is a single cluster

# Print the Within Cluster Sum of Squares


print("Within Cluster Sum of Squares :", kmeans.inertia_)

Within Cluster Sum of Squares : 140805418.54656714

Discuss : Is this the optimal clustering that you will be happy with? If not, try changing
num_clust.
Anomaly Detection for the Dataset
Extract the required variables from the dataset, and then perform Bi-Variate Anomaly Detection.

# Extract the Features from the Data


X = pd.DataFrame(houseData[['GrLivArea','GarageArea']])

# Plot the Raw Data on a 2D grid


f, axes = plt.subplots(1, 1, figsize=(16,8))
plt.scatter(x = "GrLivArea", y = "GarageArea", data = X)

<matplotlib.collections.PathCollection at 0x1dfaf326910>

Basic Anomaly Detection


Use the Nearest Neighbors (k-NN) pattern-identification method for detecting Outliers and
Anomalies.
We will use the LocalOutlierFactor neighborhood model from sklearn.neighbors
module.

# Import LocalOutlierFactor from sklearn.neighbors


from sklearn.neighbors import LocalOutlierFactor

# Set the Parameters for Neighborhood


num_neighbors = 20 # Number of Neighbors
cont_fraction = 0.05 # Fraction of Anomalies

# Create Anomaly Detection Model using LocalOutlierFactor


lof = LocalOutlierFactor(n_neighbors = num_neighbors, contamination =
cont_fraction)
# Fit the Model on the Data and Predict Anomalies
lof.fit(X)

LocalOutlierFactor(contamination=0.05)

Labeling the Anomalies in the Data


We may use the model on the data to predict the anomalies.

# Predict the Anomalies


labels = lof.fit_predict(X)

# Append Labels to the Data


X_labeled = X.copy()
X_labeled["Anomaly"] = pd.Categorical(labels)

# Summary of the Anomaly Labels


sb.countplot(x=X_labeled["Anomaly"])

<AxesSubplot: xlabel='Anomaly', ylabel='count'>

# Visualize the Anomalies in the Data


f, axes = plt.subplots(1, 1, figsize=(16,8))
plt.scatter(x = "GrLivArea", y = "GarageArea", c = "Anomaly", cmap =
'viridis', data = X_labeled)

<matplotlib.collections.PathCollection at 0x1dfb0c12e50>

Discuss : Is this the optimal anomaly detection that you will be happy with? If not, try changing
parameters.

You might also like