0% found this document useful (0 votes)

35 views8 pages

Exercise6 Solution

IE0005 Exercise solutions 6

Uploaded by

Derrick

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views8 pages

Exercise6 Solution

IE0005 Exercise solutions 6

Uploaded by

Derrick

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Exercise 6 : Clusters and Anomalies

Essential Libraries
Let us begin by importing the essential Python Libraries.

NumPy : Library for Numeric Computations in Python

Pandas : Library for Data Acquisition and Preparation
Matplotlib : Low-level library for Data Visualization
Seaborn : Higher-level library for Data Visualization

# % matplotlib inline will produce a figure immediately below

## Matplotlib Inline command is a magic command that makes the plots
generated by matplotlib show into the IPython shell that we are
running and not in a separate output window.
# # This can be omitted for some latest versions of Jupyter-Notebook
since "inline" is the default backend for them.
%matplotlib inline

# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

Setup : Import the Dataset

Dataset from Kaggle : The "House Prices" competition
Source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

The dataset is train.csv; hence we use the read_csv function from Pandas.
Immediately after importing, take a quick look at the data using the head function.

houseData = pd.read_csv('train.csv')
houseData.head()

Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape

\
0 1 60 RL 65.0 8450 Pave NaN Reg

1 2 20 RL 80.0 9600 Pave NaN Reg

2 3 60 RL 68.0 11250 Pave NaN IR1

3 4 70 RL 60.0 9550 Pave NaN IR1

4 5 60 RL 84.0 14260 Pave NaN IR1

LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal

MoSold \
0 Lvl AllPub ... 0 NaN NaN NaN 0
2
1 Lvl AllPub ... 0 NaN NaN NaN 0
5
2 Lvl AllPub ... 0 NaN NaN NaN 0
9
3 Lvl AllPub ... 0 NaN NaN NaN 0
2
4 Lvl AllPub ... 0 NaN NaN NaN 0
12

YrSold SaleType SaleCondition SalePrice

0 2008 WD Normal 208500
1 2007 WD Normal 181500
2 2008 WD Normal 223500
3 2006 WD Abnorml 140000
4 2008 WD Normal 250000

[5 rows x 81 columns]

Problem 1 : Clustering by Gr Living Area and Garage Area

Extract the required variables from the dataset, and then perform Bi-Variate Clustering.

# Extract the Features from the Data

X = pd.DataFrame(houseData[['GrLivArea','GarageArea']])

# Plot the Raw Data on a 2D grid

f, axes = plt.subplots(1, 1, figsize=(16,8))
plt.scatter(x = "GrLivArea", y = "GarageArea", data = X)

<matplotlib.collections.PathCollection at 0x1dfad8709a0>
Basic KMeans Clustering
Guess the number of clusters from the 2D plot, and perform KMeans Clustering.
We will use the KMeans clustering model from sklearn.cluster module.

# Import KMeans from sklearn.cluster

import warnings
warnings.filterwarnings("ignore",category=UserWarning,module="sklearn"
)
from sklearn.cluster import KMeans

# Guess the Number of Clusters

num_clust = 3

# Create Clustering Model using KMeans

kmeans = KMeans(n_clusters = num_clust,n_init=10)

# Fit the Clustering Model on the Data

kmeans.fit(X)

KMeans(n_clusters=3)

Print the Cluster Centers as Co-ordinates of Features

# Print the Cluster Centers

print("Features", "\tLiving", "\tGarage")
print()

for i, center in enumerate(kmeans.cluster_centers_):

print("Cluster", i, end=":\t")
for coord in center:
print(round(coord, 2), end="\t")
print()

Features Living Garage

Cluster 0: 2570.17 678.3

Cluster 1: 1086.74 375.3
Cluster 2: 1696.92 522.68

Labeling the Clusters in the Data

We may use the model on the data to predict the clusters.

# Predict the Cluster Labels

labels = kmeans.predict(X)

# Append Labels to the Data

X_labeled = X.copy()
X_labeled["Cluster"] = pd.Categorical(labels)

# Summary of the Cluster Labels

sb.countplot(x=X_labeled["Cluster"])

<AxesSubplot: xlabel='Cluster', ylabel='count'>

# Visualize the Clusters in the Data
f, axes = plt.subplots(1, 1, figsize=(16,8))
sc = plt.scatter(x = "GrLivArea", y = "GarageArea", c = "Cluster",
cmap = 'viridis', data = X_labeled)
legend = axes.legend(*sc.legend_elements(), loc="lower right",
title="Classes")

Within Cluster Sum of Squares

WithinSS = 0 : Every data point is a cluster on its own
WithinSS = Variance : Whole dataset is a single cluster

# Print the Within Cluster Sum of Squares

print("Within Cluster Sum of Squares :", kmeans.inertia_)

Within Cluster Sum of Squares : 140805418.54656714

Discuss : Is this the optimal clustering that you will be happy with? If not, try changing
num_clust.
Anomaly Detection for the Dataset
Extract the required variables from the dataset, and then perform Bi-Variate Anomaly Detection.

# Extract the Features from the Data

X = pd.DataFrame(houseData[['GrLivArea','GarageArea']])

# Plot the Raw Data on a 2D grid

f, axes = plt.subplots(1, 1, figsize=(16,8))
plt.scatter(x = "GrLivArea", y = "GarageArea", data = X)

<matplotlib.collections.PathCollection at 0x1dfaf326910>

Basic Anomaly Detection

Use the Nearest Neighbors (k-NN) pattern-identification method for detecting Outliers and
Anomalies.
We will use the LocalOutlierFactor neighborhood model from sklearn.neighbors
module.

# Import LocalOutlierFactor from sklearn.neighbors

from sklearn.neighbors import LocalOutlierFactor

# Set the Parameters for Neighborhood

num_neighbors = 20 # Number of Neighbors
cont_fraction = 0.05 # Fraction of Anomalies

# Create Anomaly Detection Model using LocalOutlierFactor

lof = LocalOutlierFactor(n_neighbors = num_neighbors, contamination =
cont_fraction)
# Fit the Model on the Data and Predict Anomalies
lof.fit(X)

LocalOutlierFactor(contamination=0.05)

Labeling the Anomalies in the Data

We may use the model on the data to predict the anomalies.

# Predict the Anomalies

labels = lof.fit_predict(X)

# Append Labels to the Data

X_labeled = X.copy()
X_labeled["Anomaly"] = pd.Categorical(labels)

# Summary of the Anomaly Labels

sb.countplot(x=X_labeled["Anomaly"])

<AxesSubplot: xlabel='Anomaly', ylabel='count'>

# Visualize the Anomalies in the Data

f, axes = plt.subplots(1, 1, figsize=(16,8))
plt.scatter(x = "GrLivArea", y = "GarageArea", c = "Anomaly", cmap =
'viridis', data = X_labeled)

<matplotlib.collections.PathCollection at 0x1dfb0c12e50>

Discuss : Is this the optimal anomaly detection that you will be happy with? If not, try changing
parameters.

Exercise3 Solution
No ratings yet
Exercise3 Solution
19 pages
Data Science: Housing Price Prediction
No ratings yet
Data Science: Housing Price Prediction
2 pages
Exercise4 Solution
No ratings yet
Exercise4 Solution
20 pages
ML Lab
No ratings yet
ML Lab
8 pages
00 Data Wrangling
No ratings yet
00 Data Wrangling
10 pages
Exercise5 Solution
No ratings yet
Exercise5 Solution
22 pages
House Price Prediction Analysis
No ratings yet
House Price Prediction Analysis
14 pages
Intro to ML with Sklearn & Python
No ratings yet
Intro to ML with Sklearn & Python
10 pages
Ex 1
No ratings yet
Ex 1
119 pages
Exercise2 Solution
No ratings yet
Exercise2 Solution
15 pages
Python Libraries for House Prices Data
No ratings yet
Python Libraries for House Prices Data
84 pages
Regression Workbook
No ratings yet
Regression Workbook
2 pages
Ds ML House Price Book
No ratings yet
Ds ML House Price Book
46 pages
ML Solution
No ratings yet
ML Solution
60 pages
Data Cleaning EDA
No ratings yet
Data Cleaning EDA
5 pages
ML Beginners: Predict House Prices
No ratings yet
ML Beginners: Predict House Prices
32 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
20 pages
Kaggle House Price Prediction Analysis
No ratings yet
Kaggle House Price Prediction Analysis
73 pages
Reading Data: #Importing Required Libraries
No ratings yet
Reading Data: #Importing Required Libraries
16 pages
House Price Prediction
No ratings yet
House Price Prediction
63 pages
Ads Exp5 Code
No ratings yet
Ads Exp5 Code
2 pages
ADS Exp3
No ratings yet
ADS Exp3
8 pages
Ass 1 ML
No ratings yet
Ass 1 ML
21 pages
House Price Prediction Analysis
No ratings yet
House Price Prediction Analysis
18 pages
Task 6
No ratings yet
Task 6
14 pages
Practical 5
No ratings yet
Practical 5
6 pages
Predicting Home Prices in Bangalore
No ratings yet
Predicting Home Prices in Bangalore
18 pages
Data Cleaning
No ratings yet
Data Cleaning
7 pages
Data Clearning
No ratings yet
Data Clearning
7 pages
Machine Learning Laboratory
No ratings yet
Machine Learning Laboratory
23 pages
Intro to Pandas for Data Science
No ratings yet
Intro to Pandas for Data Science
6 pages
West Rox
No ratings yet
West Rox
29 pages
Real Estate Data Insights
No ratings yet
Real Estate Data Insights
7 pages
ML Short Code - Under Updating
No ratings yet
ML Short Code - Under Updating
4 pages
Experiment No 11
No ratings yet
Experiment No 11
19 pages
Lasso Regression in Machine Learning
No ratings yet
Lasso Regression in Machine Learning
14 pages
EDA Techniques for Data Science Students
No ratings yet
EDA Techniques for Data Science Students
48 pages
Project Intern - Jupyter Notebook
No ratings yet
Project Intern - Jupyter Notebook
16 pages
House Price Prediction Guide
No ratings yet
House Price Prediction Guide
14 pages
DALab Part-B BCU&BU
No ratings yet
DALab Part-B BCU&BU
12 pages
House Price Prediction Models
No ratings yet
House Price Prediction Models
16 pages
Deep Learning - House Price Prediction
No ratings yet
Deep Learning - House Price Prediction
17 pages
Document From Jahnavi
No ratings yet
Document From Jahnavi
20 pages
Normialization Dataset
No ratings yet
Normialization Dataset
7 pages
The Data Science Process
100% (1)
The Data Science Process
53 pages
For ML Lab Observation - Ex No 1-10
No ratings yet
For ML Lab Observation - Ex No 1-10
48 pages
Exp - 2-EDA - CaliforniaData Set - HeatMap - PairPlot-checkpoint - Jupyter Notebook
No ratings yet
Exp - 2-EDA - CaliforniaData Set - HeatMap - PairPlot-checkpoint - Jupyter Notebook
12 pages
M PDF
No ratings yet
M PDF
13 pages
Eda Project
No ratings yet
Eda Project
28 pages
DWM Practical
No ratings yet
DWM Practical
12 pages
House Price Prediction Model Guide
No ratings yet
House Price Prediction Model Guide
187 pages
P04 The Regression Pipeline - Preprocessing Ans
No ratings yet
P04 The Regression Pipeline - Preprocessing Ans
19 pages
Data Analysis With Python - Jupyter Notebook
No ratings yet
Data Analysis With Python - Jupyter Notebook
10 pages
Sales Data Clustering
No ratings yet
Sales Data Clustering
15 pages
Data Manipulation With Pandas - Yulei's Sandbox
No ratings yet
Data Manipulation With Pandas - Yulei's Sandbox
18 pages
ML Lab - Exp1-10
No ratings yet
ML Lab - Exp1-10
4 pages
Housing Data Analysis with Python
No ratings yet
Housing Data Analysis with Python
26 pages
Data Exploration with Python on Kaggle
No ratings yet
Data Exploration with Python on Kaggle
20 pages
AI & Data Science Lab Record
No ratings yet
AI & Data Science Lab Record
28 pages
Thinkcentre m710t Tower
No ratings yet
Thinkcentre m710t Tower
1 page
AxioCam MR - Reference Guide
No ratings yet
AxioCam MR - Reference Guide
89 pages
Java - Regular Expressions: Capturing Groups
No ratings yet
Java - Regular Expressions: Capturing Groups
7 pages
Introduction to Internet Protocols
No ratings yet
Introduction to Internet Protocols
26 pages
Learning PostgreSQL 10 A Beginner S Guide To Building High Performance PostgreSQL Database Solutions Juba PDF Download
100% (3)
Learning PostgreSQL 10 A Beginner S Guide To Building High Performance PostgreSQL Database Solutions Juba PDF Download
59 pages
Lab1 Introduction To ERDAS IMAGINE
No ratings yet
Lab1 Introduction To ERDAS IMAGINE
10 pages
Report Edited
No ratings yet
Report Edited
60 pages
DNM Sammafatning
No ratings yet
DNM Sammafatning
2 pages
CSE316 Operating Systems Course Plan
No ratings yet
CSE316 Operating Systems Course Plan
7 pages
Ai-102 7
No ratings yet
Ai-102 7
18 pages
GUIDE - Zscaler ZPA Integration Guide
No ratings yet
GUIDE - Zscaler ZPA Integration Guide
41 pages
BUET M.sc. in CSE && IICT Admission Help Guide
67% (3)
BUET M.sc. in CSE && IICT Admission Help Guide
25 pages
CEZA Regulations - 2018
100% (1)
CEZA Regulations - 2018
48 pages
Linux Commands For Lab Practice
No ratings yet
Linux Commands For Lab Practice
3 pages
An Event Can Be: 1) Social / Life-Cycle Events
No ratings yet
An Event Can Be: 1) Social / Life-Cycle Events
5 pages
Seminar On Eball Technology
No ratings yet
Seminar On Eball Technology
40 pages
CCNA Routing and Switching Introductory Presentation: by - Aayush Agarwal CSE, 7K1
No ratings yet
CCNA Routing and Switching Introductory Presentation: by - Aayush Agarwal CSE, 7K1
20 pages
04 PAS Essentials CPM and PVWA Installation
No ratings yet
04 PAS Essentials CPM and PVWA Installation
14 pages
C Programming Quiz for BCOM Students
No ratings yet
C Programming Quiz for BCOM Students
2 pages
OOABAP Basics for New Programmers
No ratings yet
OOABAP Basics for New Programmers
7 pages
Deep Learning Exam: RNN & SOM Analysis
No ratings yet
Deep Learning Exam: RNN & SOM Analysis
6 pages
ONGC OES Vendor Checklist Rev 5 06 11 2013 2 2 PDF
No ratings yet
ONGC OES Vendor Checklist Rev 5 06 11 2013 2 2 PDF
6 pages
IGNOU BA Honors Marks Statement
No ratings yet
IGNOU BA Honors Marks Statement
2 pages
Megawin 8051 ISP Programmer Manual
No ratings yet
Megawin 8051 ISP Programmer Manual
23 pages
Ravi Teja Anantha Ramanand: Mirafra Technologies
No ratings yet
Ravi Teja Anantha Ramanand: Mirafra Technologies
104 pages
Hand Gesture Control for Presentations
No ratings yet
Hand Gesture Control for Presentations
9 pages
Sem II Question Papers - 2022
No ratings yet
Sem II Question Papers - 2022
2 pages
Deleting Nodes in Linked Lists
No ratings yet
Deleting Nodes in Linked Lists
13 pages
Security Policy SSG-5-20 ScreenOS 5 4
No ratings yet
Security Policy SSG-5-20 ScreenOS 5 4
21 pages
802.11b - G - N User Manual
No ratings yet
802.11b - G - N User Manual
21 pages

Exercise6 Solution

Uploaded by

Exercise6 Solution

Uploaded by

Exercise 6 : Clusters and Anomalies

NumPy : Library for Numeric Computations in Python

# % matplotlib inline will produce a figure immediately below

Setup : Import the Dataset

Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape

1 2 20 RL 80.0 9600 Pave NaN Reg

2 3 60 RL 68.0 11250 Pave NaN IR1

4 5 60 RL 84.0 14260 Pave NaN IR1

LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal

YrSold SaleType SaleCondition SalePrice

Problem 1 : Clustering by Gr Living Area and Garage Area

# Extract the Features from the Data

# Plot the Raw Data on a 2D grid

# Import KMeans from sklearn.cluster

# Guess the Number of Clusters

# Create Clustering Model using KMeans

# Fit the Clustering Model on the Data

Print the Cluster Centers as Co-ordinates of Features

# Print the Cluster Centers

for i, center in enumerate(kmeans.cluster_centers_):

Features Living Garage

Cluster 0: 2570.17 678.3

Labeling the Clusters in the Data

# Predict the Cluster Labels

# Append Labels to the Data

# Summary of the Cluster Labels

<AxesSubplot: xlabel='Cluster', ylabel='count'>

Within Cluster Sum of Squares

# Print the Within Cluster Sum of Squares

Within Cluster Sum of Squares : 140805418.54656714

# Extract the Features from the Data

# Plot the Raw Data on a 2D grid

Basic Anomaly Detection

# Import LocalOutlierFactor from sklearn.neighbors

# Set the Parameters for Neighborhood

# Create Anomaly Detection Model using LocalOutlierFactor

Labeling the Anomalies in the Data

# Predict the Anomalies

# Append Labels to the Data

# Summary of the Anomaly Labels

<AxesSubplot: xlabel='Anomaly', ylabel='count'>

# Visualize the Anomalies in the Data

You might also like