0% found this document useful (0 votes)

41 views15 pages

IDM Assignment

;'./

Uploaded by

Farah Jahangir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views15 pages

IDM Assignment

;'./

Uploaded by

Farah Jahangir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Farah Jahangir

Introduction to Data Mining

Project Assignment

Task 1
Part A
Code:
1. Combining Dataset
2. import os
3. import pandas as pd
4. import glob
5.
6. # Folder path containing your CSV files
7. folder_path = '/content/drive/MyDrive/dataset'
8.
9. # Use glob to find all CSV files in the folder
10. csv_files = [Link]([Link](folder_path, "*.csv"))
11.
12. # List to hold all dataframes
13. df_list = []
14.
15. # Loop through each CSV file and load it into a DataFrame
16. for file in csv_files:
17. df = pd.read_csv(file)
18. df_list.append(df)
19.
20. # Combine all DataFrames into one
21. combined_data = [Link](df_list, ignore_index=True)
22.
23. # Check the first few rows of the combined data
24. print(combined_data.head())
2. Droping Irrelevant Columns

from [Link] import StandardScaler

# Drop Date and Class columns

data =
combined_data.drop(columns=['date','humidity9am','pressure9am','temp9am','
rain_today','rain_tomorrow','wind_speed9am','cloud9am'])
print([Link]())
3. Mapping Values

import pandas as pd
from [Link] import StandardScaler

# Assuming 'data' is your original DataFrame that contains the cloud cover
column (named 'cloud3pm')

# Mapping of cloud cover categories to numerical values (0 to 16)

cloud_cover_mapping = {
'Fair / Windy': 0, 'Partly Cloudy': 1, 'Partly Cloudy / Windy': 2,
'Cloudy': 3,
'Cloudy / Windy': 4, 'Mostly Cloudy': 5, 'Mostly Cloudy / Windy': 6,
'Fog': 7,
'Haze': 8, 'Light Rain': 9, 'Light Rain with Thunder': 10, 'Thunder':
11,
'Rain': 12, 'Thunder / Windy': 13, 'Heavy T-Storm': 14, 'Thunder in
the Vicinity': 15, 'TStorm': 16
}

# Load your dataset (replace 'your_file.csv' with your actual file path)
df = pd.read_csv('/content/scaled_weather_data.csv')

# Map the 'cloud3pm' column to numerical values using the mapping

df['cloud_cover'] = df['cloud3pm'].map(cloud_cover_mapping)

# Drop the original 'cloud3pm' column with string values

df = [Link](columns=['cloud3pm'])

# Save the scaled DataFrame into a new CSV file

data.to_csv('new_weather_data.csv', index=False)

# Confirm that the data has been saved

print("Data has been scaled and saved to 'scaled_weather_data.csv'.")
Forming clusters:

import pandas as pd
import numpy as np
from [Link] import KMeans
from [Link] import StandardScaler
import [Link] as plt
import seaborn as sns
# Step 1: Load the dataset
df = pd.read_csv('/content/modified_weather_data.csv') # Replace with the
actual path to your CSV file

# Step 2: Replace 'Blank' values with NaN for numerical columns

[Link]('Blank', [Link], inplace=True)

# Step 3: Convert all columns to numeric, coercing any non-numeric data to

NaN
df = [Link](pd.to_numeric, errors='coerce')

# Step 4: Impute missing values with median (as before)

df = [Link]([Link]())

# Step 7: Select only numeric columns for clustering

numeric_data = df.select_dtypes(include=[[Link]]) # Select only
numeric columns for clustering

# Step 8: Standardize the data

scaler = StandardScaler()
scaled_data = scaler.fit_transform(numeric_data)

# Step 9: Apply K-means clustering (k=3)

kmeans = KMeans(n_clusters=3, random_state=42)
[Link](scaled_data)

# Step 10: Get the cluster labels

df['cluster'] = kmeans.labels_

# Step 11: Report the centroids of the clusters

centroids = [Link](kmeans.cluster_centers_,
columns=numeric_data.columns)
print("Centroids of the clusters:")
print(centroids)

# Step 12: Visualize the clusters using boxplots for selected attributes
selected_columns = ['min_temp', 'max_temp', 'rainfall', 'humidity3pm',
'wind_speed3pm', 'pressure3pm']
[Link](figsize=(15, 10))

for i, column in enumerate(selected_columns, 1):

[Link](2, 3, i)
[Link](x='cluster', y=column, data=df)
[Link](f'Boxplot of {column} by Cluster')
plt.tight_layout()
[Link]()

# Step 13: Visualize the clusters using scatter plots (for 2D projection)
# First, let's reduce the data to 2D for visualization using PCA
(Principal Component Analysis)
from [Link] import PCA

pca = PCA(n_components=2)
pca_data = pca.fit_transform(scaled_data)

# Scatter plot of clusters in 2D space

[Link](figsize=(8, 6))
[Link](pca_data[:, 0], pca_data[:, 1], c=df['cluster'],
cmap='viridis', s=50)
[Link]('K-means Clustering (2D PCA projection)')
[Link]('PCA Component 1')
[Link]('PCA Component 2')
[Link](label='Cluster')
[Link]()

Results:
K-Mean Clustering with K=3

Box plots:
Part B
Code:
import pandas as pd
import numpy as np
from [Link] import StandardScaler
from [Link] import DBSCAN
from [Link] import pair_confusion_matrix
import [Link] as plt
import seaborn as sns
from collections import Counter

# Step 1: Load and preprocess the dataset

df = pd.read_csv('/content/modified_weather_data.csv') # Replace with the
actual path to your CSV file

# Step 2: Replace 'Blank' values with NaN for numerical columns

[Link]('Blank', [Link], inplace=True)

# Step 3: Convert all columns to numeric, coercing any non-numeric data to

NaN
df = [Link](pd.to_numeric, errors='coerce')
# Step 4: Impute missing values with median
df = [Link]([Link]())

# Step 5: Select only numeric columns for clustering

numeric_data = df.select_dtypes(include=[[Link]]) # Select only
numeric columns for clustering

# Step 6: Standardize the data

scaler = StandardScaler()
scaled_data = scaler.fit_transform(numeric_data)

# Step 7: Apply DBSCAN clustering

# We will try different values of eps and min_samples to get between 2 and
15 clusters with less than 20% outliers

# Best configuration for DBSCAN (after tuning) -- adjust eps and

min_samples
dbscan = DBSCAN(eps=0.5, min_samples=5) # You can adjust these parameters
as needed
[Link](scaled_data)

# Add cluster labels to dataframe

df['dbscan_cluster'] = dbscan.labels_

# Identify the number of outliers (labeled as -1 in DBSCAN)

outliers = [Link](df['dbscan_cluster'] == -1)
total_points = len(df)
outlier_percentage = outliers / total_points * 100
print(f"Outlier percentage in DBSCAN: {outlier_percentage:.2f}%")

# Check if outliers are below 20% (target condition)

if outlier_percentage > 20:
print("Outliers exceed 20%, adjusting DBSCAN parameters.")
else:
print("Outliers are below 20%, proceed to next steps.")

# Step 8: Visualize the DBSCAN clusters using a scatter plot (2D PCA
projection)
from [Link] import PCA

pca = PCA(n_components=2)
pca_data = pca.fit_transform(scaled_data)

# Scatter plot of DBSCAN clusters

[Link](figsize=(8, 6))
[Link](pca_data[:, 0], pca_data[:, 1], c=df['dbscan_cluster'],
cmap='viridis', s=50)
[Link]('DBSCAN Clustering (2D PCA projection)')
[Link]('PCA Component 1')
[Link]('PCA Component 2')
[Link](label='Cluster')
[Link]()

Results:
Visualization of DBSCAN Clustering Algorithm
Task 2

Code:

import pandas as pd
import numpy as np
from [Link] import StandardScaler
from [Link] import KernelDensity
from [Link] import cdist
import [Link] as plt
# Step 1: Load the dataset
df = pd.read_csv('/content/modified_weather_data.csv') # Replace with act
ual file path
# Step 2: Preprocess the data
[Link]('Blank', [Link], inplace=True) # Handle missing values
df = [Link](pd.to_numeric, errors='coerce') # Convert all columns to nu
meric
[Link]([Link](), inplace=True) # Fill missing values with median
# Select relevant columns
features = ['min_temp', 'max_temp', 'rainfall', 'wind_speed3pm', 'humidity
3pm','pressure3pm','cloud_cover']
data = df[features]
# Step 3: Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
# --- Distance-based Outlier Detection ---
def calculate_distance_outlier_scores(data, threshold=2):
# Calculate pairwise distances using Euclidean distance
distances = cdist(data, data, metric='euclidean')
# Calculate the mean distance for each point
mean_distances = [Link](axis=1)
# Outlier scores based on distance threshold
outlier_scores = mean_distances / mean_distances.max() # Normalize to
range [0, 1]
return outlier_scores
# --- Density-based Outlier Detection ---
def calculate_density_outlier_scores(data, bandwidth=0.5):
# Use KernelDensity to estimate density
kde = KernelDensity(kernel='gaussian', bandwidth=bandwidth)
[Link](data)
# Get the log of the density for each point
log_density = kde.score_samples(data)
# Convert to outlier scores (higher log_density = less likely to be an
outlier)
outlier_scores = -log_density / -
log_density.max() # Normalize to range [0, 1]
return outlier_scores
# Step 4: Calculate OLS for both methods
distance_outlier_scores = calculate_distance_outlier_scores(scaled_data)
density_outlier_scores = calculate_density_outlier_scores(scaled_data)
# Add OLS to dataframe
df['distance_OLS'] = distance_outlier_scores
df['density_OLS'] = density_outlier_scores
# Step 5: Sort the dataset by OLS scores and analyze the top/bottom exampl
es
df_sorted_distance = df.sort_values(by='distance_OLS', ascending=False)
df_sorted_density = df.sort_values(by='density_OLS', ascending=False)
# Top 3 likely outliers
print("Top 3 outliers based on distance-based OLS:")
print(df_sorted_distance.head(3))
print("Top 3 outliers based on density-based OLS:")
print(df_sorted_density.head(3))
# Bottom example (most normal)
print("Most normal (bottom) based on distance-based OLS:")
print(df_sorted_distance.tail(1))
print("Most normal (bottom) based on density-based OLS:")
print(df_sorted_density.tail(1))

Results of Outliers Detecting Techniques

Top 3 outliers based on distance-based OLS:
min_temp max_temp rainfall wind_speed3pm humidity3pm pressure3pm
2114 73.0 83.0 20.6 20.0 94.0 29.39
1212 64.0 74.0 18.8 6.0 82.0 29.97
3470 73.0 78.0 18.2 7.0 96.0 29.99

temp3pm cloud_cover distance_OLS density_OLS

2114 75.0 4.0 1.000000 2.000083
1212 74.0 4.0 0.889223 1.988138
3470 75.0 4.0 0.866567 1.988138

Top 3 outliers based on density-based OLS:

min_temp max_temp rainfall wind_speed3pm humidity3pm pressure3pm \
2114 73.0 83.0 20.6 20.0 94.0 29.39
3024 0.0 79.0 1.2 7.0 88.0 29.83
2815 75.0 81.0 11.1 22.0 90.0 29.74

temp3pm cloud_cover distance_OLS density_OLS

2114 75.0 4.0 1.000000 2.000083
3024 77.0 15.0 0.383273 2.000083
2815 77.0 4.0 0.560338 2.000083

Most normal (bottom) based on distance-based OLS:

min_temp max_temp rainfall wind_speed3pm humidity3pm pressure3pm \
2128 63.0 81.0 0.0 10.0 54.0 29.93

temp3pm cloud_cover distance_OLS density_OLS

2128 80.0 4.0 0.132443 1.27591

Most normal (bottom) based on density-based OLS:

min_temp max_temp rainfall wind_speed3pm humidity3pm pressure3pm \
1353 76.0 91.0 0.0 10.0 55.0 29.89

temp3pm cloud_cover distance_OLS density_OLS

1353 90.0 6.0 0.149656 1.0

Code For Visualizing the results:

import pandas as pd
import numpy as np
from [Link] import StandardScaler
from [Link] import KernelDensity
from [Link] import cdist
import [Link] as plt

# Step 1: Load the dataset

df = pd.read_csv('/content/modified_weather_data.csv') # Replace with
actual file path

# Step 2: Preprocess the data

[Link]('Blank', [Link], inplace=True) # Handle missing values
df = [Link](pd.to_numeric, errors='coerce') # Convert all columns to
numeric
[Link]([Link](), inplace=True) # Fill missing values with median

# Select relevant columns

features = ['min_temp', 'max_temp', 'rainfall', 'wind_speed3pm',
'humidity3pm','pressure3pm','cloud_cover']
data = df[features]

# Step 3: Standardize the data

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

# --- Distance-based Outlier Detection ---

def calculate_distance_outlier_scores(data, threshold=2):
# Calculate pairwise distances using Euclidean distance
distances = cdist(data, data, metric='euclidean')
# Calculate the mean distance for each point
mean_distances = [Link](axis=1)
# Outlier scores based on distance threshold
outlier_scores = mean_distances / mean_distances.max() # Normalize to
range [0, 1]
return outlier_scores

# --- Density-based Outlier Detection ---

def calculate_density_outlier_scores(data, bandwidth=0.5):
# Use KernelDensity to estimate density
kde = KernelDensity(kernel='gaussian', bandwidth=bandwidth)
[Link](data)
# Get the log of the density for each point
log_density = kde.score_samples(data)
# Convert to outlier scores (higher log_density = less likely to be an
outlier)
outlier_scores = -log_density / -log_density.max() # Normalize to
range [0, 1]
return outlier_scores

# Step 4: Calculate OLS for both methods

distance_outlier_scores = calculate_distance_outlier_scores(scaled_data)
density_outlier_scores = calculate_density_outlier_scores(scaled_data)

# Add OLS to dataframe

df['distance_OLS'] = distance_outlier_scores
df['density_OLS'] = density_outlier_scores

# Step 5: Sort the dataset by OLS scores and analyze the top/bottom
examples
df_sorted_distance = df.sort_values(by='distance_OLS', ascending=False)
df_sorted_density = df.sort_values(by='density_OLS', ascending=False)

# Top 3 likely outliers

print("Top 3 outliers based on distance-based OLS:")
print(df_sorted_distance.head(3))

print("Top 3 outliers based on density-based OLS:")

print(df_sorted_density.head(3))

# Bottom example (most normal)

print("Most normal (bottom) based on distance-based OLS:")
print(df_sorted_distance.tail(1))

print("Most normal (bottom) based on density-based OLS:")

print(df_sorted_density.tail(1))

Visualization Results

Comparison Between Techniques

Code:
import [Link] as plt
import seaborn as sns

# Step 7: Visualize the OLS Scores

# Plot the distribution of distance-based and density-based OLS

[Link](figsize=(14, 6))

# Distance-based OLS Distribution

[Link](1, 2, 1)
[Link](df['distance_OLS'], kde=True, color='blue', bins=30)
[Link]('Distribution of Distance-based OLS Scores')
[Link]('Distance-based OLS Score')
[Link]('Frequency')

# Density-based OLS Distribution

[Link](1, 2, 2)
[Link](df['density_OLS'], kde=True, color='green', bins=30)
[Link]('Distribution of Density-based OLS Scores')
[Link]('Density-based OLS Score')
[Link]('Frequency')

plt.tight_layout()
[Link]()

# Step 8: Visualize the Top 3 Outliers and Bottom Example

# Top 3 outliers based on distance-based OLS

top_3_distance_outliers = df_sorted_distance.head(3)
top_3_distance_outliers = top_3_distance_outliers[features +
['distance_OLS']]

# Top 3 outliers based on density-based OLS

top_3_density_outliers = df_sorted_density.head(3)
top_3_density_outliers = top_3_density_outliers[features +
['density_OLS']]

# Plot top 3 distance-based outliers

[Link](figsize=(14, 6))

[Link](1, 2, 1)
[Link](x='min_temp', y='max_temp', data=top_3_distance_outliers,
color='red', s=100, label='Top 3 Distance-based Outliers')
[Link]('Top 3 Distance-based Outliers')
[Link]('Min Temp (°F)')
[Link]('Max Temp (°F)')

# Plot top 3 density-based outliers

[Link](1, 2, 2)
[Link](x='min_temp', y='max_temp', data=top_3_density_outliers,
color='orange', s=100, label='Top 3 Density-based Outliers')
[Link]('Top 3 Density-based Outliers')
[Link]('Min Temp (°F)')
[Link]('Max Temp (°F)')

plt.tight_layout()
[Link]()

# Step 9: Scatter plot comparing distance-based and density-based OLS

scores
[Link](figsize=(8, 6))
[Link](x=df['distance_OLS'], y=df['density_OLS'], color='purple')
[Link]('Comparison of Distance-based vs Density-based OLS Scores')
[Link]('Distance-based OLS Score')
[Link]('Density-based OLS Score')
plt.tight_layout()
[Link]()

# Step 10: Visualize the most normal day (bottom example) for both OLS
methods

# Most normal day based on distance-based OLS

most_normal_distance = df_sorted_distance.tail(1)

# Most normal day based on density-based OLS

most_normal_density = df_sorted_density.tail(1)

# Plot most normal day for both distance-based and density-based OLS
[Link](figsize=(12, 6))

[Link](1, 2, 1)
[Link](x='min_temp', y='max_temp', data=most_normal_distance,
color='blue', s=100, label='Most Normal (Distance-based)')
[Link]('Most Normal Day (Distance-based OLS)')
[Link]('Min Temp (°F)')
[Link]('Max Temp (°F)')

[Link](1, 2, 2)
[Link](x='min_temp', y='max_temp', data=most_normal_density,
color='green', s=100, label='Most Normal (Density-based)')
[Link]('Most Normal Day (Density-based OLS)')
[Link]('Min Temp (°F)')
[Link]('Max Temp (°F)')

plt.tight_layout()
[Link]()

Results

Practical 5
No ratings yet
Practical 5
6 pages
Weather Data ML Model Guide
No ratings yet
Weather Data ML Model Guide
4 pages
Untitled Document-2-1-13-7-11.4
No ratings yet
Untitled Document-2-1-13-7-11.4
5 pages
Weather Data Analysis and Prediction
No ratings yet
Weather Data Analysis and Prediction
22 pages
DWM Practical
No ratings yet
DWM Practical
12 pages
Python Scripts For Machine Learning
No ratings yet
Python Scripts For Machine Learning
13 pages
Clustering Techniques in Python Analysis
No ratings yet
Clustering Techniques in Python Analysis
10 pages
Data Mining Ex1
No ratings yet
Data Mining Ex1
10 pages
Rainfall Prediction Using Machine Learning
No ratings yet
Rainfall Prediction Using Machine Learning
9 pages
SOLUTION ONLY CODE DWDM - Lab - All
No ratings yet
SOLUTION ONLY CODE DWDM - Lab - All
8 pages
Unit 3 Unsupervised Learning
No ratings yet
Unit 3 Unsupervised Learning
9 pages
Feature Engineering: Scaling Techniques
No ratings yet
Feature Engineering: Scaling Techniques
13 pages
ML Lab Exam Document
No ratings yet
ML Lab Exam Document
14 pages
MLFILE
No ratings yet
MLFILE
21 pages
ML Short
No ratings yet
ML Short
2 pages
Cluster Analysis in Spark
No ratings yet
Cluster Analysis in Spark
10 pages
Rajeek8 12
No ratings yet
Rajeek8 12
21 pages
DWDM Lab All
No ratings yet
DWDM Lab All
20 pages
RNN Temperature Forecasting Guide
No ratings yet
RNN Temperature Forecasting Guide
9 pages
23CC554
No ratings yet
23CC554
10 pages
Advance Python
No ratings yet
Advance Python
5 pages
UNITIV BtechIot
No ratings yet
UNITIV BtechIot
43 pages
Weather Report
No ratings yet
Weather Report
7 pages
MLRecord
No ratings yet
MLRecord
24 pages
ML0101EN Clus DBSCN Weather Py v1
No ratings yet
ML0101EN Clus DBSCN Weather Py v1
16 pages
Machine Learning Lab - Preprocessing
No ratings yet
Machine Learning Lab - Preprocessing
13 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
Baidurya Debnath 4
No ratings yet
Baidurya Debnath 4
37 pages
Implementing K-Means Clustering: '/content/mall - Customers (1) .CSV'
No ratings yet
Implementing K-Means Clustering: '/content/mall - Customers (1) .CSV'
8 pages
ML Short Code - Under Updating
No ratings yet
ML Short Code - Under Updating
4 pages
Data Mining Techniques for CKD Analysis
No ratings yet
Data Mining Techniques for CKD Analysis
12 pages
Exp 6
No ratings yet
Exp 6
10 pages
KMeans Clustering
No ratings yet
KMeans Clustering
1 page
Data Science Module 5
No ratings yet
Data Science Module 5
28 pages
Exp-2 ML
No ratings yet
Exp-2 ML
6 pages
M PDF
No ratings yet
M PDF
13 pages
Apriori Algorithm & Clustering Guide
No ratings yet
Apriori Algorithm & Clustering Guide
8 pages
Autoregressive Model with Statsmodels
No ratings yet
Autoregressive Model with Statsmodels
10 pages
Clustering
No ratings yet
Clustering
1 page
ML - Datascience Manual
No ratings yet
ML - Datascience Manual
64 pages
Résumé-Analyse Des Données Resumee Resumee
No ratings yet
Résumé-Analyse Des Données Resumee Resumee
4 pages
BIG DATA - Assign
No ratings yet
BIG DATA - Assign
28 pages
ML 3
No ratings yet
ML 3
24 pages
Eda Lab Assignment2
No ratings yet
Eda Lab Assignment2
10 pages
Numpy Cheatsheet
No ratings yet
Numpy Cheatsheet
11 pages
Unit1 ML Programs
No ratings yet
Unit1 ML Programs
5 pages
Practical File of AI and ML
No ratings yet
Practical File of AI and ML
26 pages
ML - Lab Manual
No ratings yet
ML - Lab Manual
54 pages
D3 Docs
No ratings yet
D3 Docs
6 pages
HW1
No ratings yet
HW1
11 pages
AbidAdhikari26840 DWDM
No ratings yet
AbidAdhikari26840 DWDM
43 pages
Casos de ML Unsupervised Daniel Ames Camayo
No ratings yet
Casos de ML Unsupervised Daniel Ames Camayo
20 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
43 pages
Lab Extern L
No ratings yet
Lab Extern L
8 pages
ML Lab - Exp1-10
No ratings yet
ML Lab - Exp1-10
4 pages
Data Mining Assignment No. 1
No ratings yet
Data Mining Assignment No. 1
22 pages
Subspace Cluster I Nig
No ratings yet
Subspace Cluster I Nig
6 pages
Drawback of Standard K-Means Algorithm
No ratings yet
Drawback of Standard K-Means Algorithm
5 pages
Lecture6-Tfidf Vector Space Model
No ratings yet
Lecture6-Tfidf Vector Space Model
45 pages
Lec - 08 SIP CIS-322 Im Enhan
No ratings yet
Lec - 08 SIP CIS-322 Im Enhan
10 pages
Advanced IR Scoring Techniques
No ratings yet
Advanced IR Scoring Techniques
43 pages
Lecture - 06 (Shared Memory Programming With OpenMP)
No ratings yet
Lecture - 06 (Shared Memory Programming With OpenMP)
65 pages
Ethical Relativism-1
No ratings yet
Ethical Relativism-1
12 pages
2nd Ed 1 To 3 Exercise Answer and Qutsion Compilers Principles, Techniques, & Tools (Purple Dragon Book) Second Edition Exercise Answers
No ratings yet
2nd Ed 1 To 3 Exercise Answer and Qutsion Compilers Principles, Techniques, & Tools (Purple Dragon Book) Second Edition Exercise Answers
59 pages
Selection Sort and Insertion Sort
No ratings yet
Selection Sort and Insertion Sort
5 pages
Two Echelon Vehicle Routing Problem With Drones in Last Mile Delivery
No ratings yet
Two Echelon Vehicle Routing Problem With Drones in Last Mile Delivery
14 pages
Theory PDF
No ratings yet
Theory PDF
15 pages
Sorting Algorithms With Python3 PDF
No ratings yet
Sorting Algorithms With Python3 PDF
37 pages
COMS 6998 Lec 1
No ratings yet
COMS 6998 Lec 1
8 pages
Analysis & Design of Algorithms: Binary Search
No ratings yet
Analysis & Design of Algorithms: Binary Search
21 pages
Lec 14 - Characterization of LTI Systems Using The Laplace Transform v1.0
No ratings yet
Lec 14 - Characterization of LTI Systems Using The Laplace Transform v1.0
30 pages
Fuck You Scribd
No ratings yet
Fuck You Scribd
1 page
Data Mining Exam Paper - Winter 2024
No ratings yet
Data Mining Exam Paper - Winter 2024
2 pages
Introductory Pages
No ratings yet
Introductory Pages
17 pages
Vubc - CSD203 PRACTICALEXAM FALL 2023
No ratings yet
Vubc - CSD203 PRACTICALEXAM FALL 2023
2 pages
CNN With Example Explanation
No ratings yet
CNN With Example Explanation
2 pages
SHA3: Evolution and Future Prospects
No ratings yet
SHA3: Evolution and Future Prospects
57 pages
15cs204j-Algorithm Design and Analysis
No ratings yet
15cs204j-Algorithm Design and Analysis
3 pages
Load Flow Studies 10 Mark Answers Final
No ratings yet
Load Flow Studies 10 Mark Answers Final
3 pages
Adsa Question - Bank 2025
No ratings yet
Adsa Question - Bank 2025
10 pages
Lecture 11 - Randomized-QuickSort
No ratings yet
Lecture 11 - Randomized-QuickSort
12 pages
Problem 6
No ratings yet
Problem 6
2 pages
Pre-Trained Image Processing Transformer
No ratings yet
Pre-Trained Image Processing Transformer
12 pages
Finite Differences & Interpolation Guide
No ratings yet
Finite Differences & Interpolation Guide
10 pages
Architecture of Inception-ResnetV2
No ratings yet
Architecture of Inception-ResnetV2
6 pages
Bisection - Examples For CIVIL ENG
No ratings yet
Bisection - Examples For CIVIL ENG
5 pages
MATH3067 (Coding Theory) 2012: R3 R3-R1 R3 R3-R2
No ratings yet
MATH3067 (Coding Theory) 2012: R3 R3-R1 R3 R3-R2
3 pages
COMP6049001-Algorithm Design and Analysis-01
No ratings yet
COMP6049001-Algorithm Design and Analysis-01
69 pages
Directional Gaussian Filter Guide
No ratings yet
Directional Gaussian Filter Guide
12 pages
Iterative Methods for Linear Systems in MATLAB
No ratings yet
Iterative Methods for Linear Systems in MATLAB
6 pages
Matrix Case Study
No ratings yet
Matrix Case Study
51 pages
Bisection and False Position Methods
No ratings yet
Bisection and False Position Methods
17 pages
PCM Transmission Bandwidth Analysis
No ratings yet
PCM Transmission Bandwidth Analysis
13 pages
Lockheed Martin Digital Payload
No ratings yet
Lockheed Martin Digital Payload
13 pages

IDM Assignment

Uploaded by

IDM Assignment

Uploaded by

Farah Jahangir

Introduction to Data Mining

from [Link] import StandardScaler

# Drop Date and Class columns

# Mapping of cloud cover categories to numerical values (0 to 16)

# Map the 'cloud3pm' column to numerical values using the mapping

# Drop the original 'cloud3pm' column with string values

# Save the scaled DataFrame into a new CSV file

# Confirm that the data has been saved

# Step 2: Replace 'Blank' values with NaN for numerical columns

# Step 3: Convert all columns to numeric, coercing any non-numeric data to

# Step 4: Impute missing values with median (as before)

# Step 7: Select only numeric columns for clustering

# Step 8: Standardize the data

# Step 9: Apply K-means clustering (k=3)

# Step 10: Get the cluster labels

# Step 11: Report the centroids of the clusters

for i, column in enumerate(selected_columns, 1):

# Scatter plot of clusters in 2D space

# Step 1: Load and preprocess the dataset

# Step 2: Replace 'Blank' values with NaN for numerical columns

# Step 3: Convert all columns to numeric, coercing any non-numeric data to

# Step 5: Select only numeric columns for clustering

# Step 6: Standardize the data

# Step 7: Apply DBSCAN clustering

# Best configuration for DBSCAN (after tuning) -- adjust eps and

# Add cluster labels to dataframe

# Identify the number of outliers (labeled as -1 in DBSCAN)

# Check if outliers are below 20% (target condition)

# Scatter plot of DBSCAN clusters

Results of Outliers Detecting Techniques

temp3pm cloud_cover distance_OLS density_OLS

Top 3 outliers based on density-based OLS:

temp3pm cloud_cover distance_OLS density_OLS

Most normal (bottom) based on distance-based OLS:

temp3pm cloud_cover distance_OLS density_OLS

Most normal (bottom) based on density-based OLS:

temp3pm cloud_cover distance_OLS density_OLS

Code For Visualizing the results:

# Step 1: Load the dataset

# Step 2: Preprocess the data

# Select relevant columns

# Step 3: Standardize the data

# --- Distance-based Outlier Detection ---

# --- Density-based Outlier Detection ---

# Step 4: Calculate OLS for both methods

# Add OLS to dataframe

# Top 3 likely outliers

print("Top 3 outliers based on density-based OLS:")

# Bottom example (most normal)

print("Most normal (bottom) based on density-based OLS:")

Comparison Between Techniques

# Step 7: Visualize the OLS Scores

# Plot the distribution of distance-based and density-based OLS

# Distance-based OLS Distribution

# Density-based OLS Distribution

# Step 8: Visualize the Top 3 Outliers and Bottom Example

# Top 3 outliers based on distance-based OLS

# Top 3 outliers based on density-based OLS

# Plot top 3 distance-based outliers

# Plot top 3 density-based outliers

# Step 9: Scatter plot comparing distance-based and density-based OLS

# Most normal day based on distance-based OLS

# Most normal day based on density-based OLS

You might also like