0% found this document useful (0 votes)

31 views4 pages

Data Mining - Data Preparation Report

The report outlines the initial data preparation steps for a data mining project, including filling missing values, handling outliers, and normalization techniques. Key steps include data selection, binning of age categories, and data reduction by removing redundant attributes. The first iteration of data preparation utilized a RandomForestClassifier with optimized parameters, although further work is needed to improve results.

Uploaded by

Ömer Faruk Sanlı

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views4 pages

Data Mining - Data Preparation Report

Uploaded by

Ömer Faruk Sanlı

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Data Mining - Data Preparation Report

Alper Başdoğan 19155035

Uğurcan Doğan 20155071
Ömer Faruk Sanlı 20155806

Before the trial-and-test stage, some data preparation steps are completed because we are not
planning to change them in order to have different outcomes. On the other hand, the variable
steps here are the following:
● Filling of the missing values.
● The way we deal with outliers.
● Normalization technique.
Above steps have different ways to be achieved, and all can alter the performance of the
model. In this report, full iteration has not been completed yet and the first versions of each
step is used to get the first results.

Steps

➔Data Selection
Our first step was to merge the test set with ground truth survival labels for evaluation.

➔Binning
Ages are categorized into discrete values of [0, 18, 25, 40, 60, 75, 90]. Bin ranges might be
changed in the future to see if they’ll have any effect on the end results.

➔Data Reduction
◆Remove redundant attributes:

At this step we tried to determine the redundancy of some attributes.

1. Train data has 891 rows. Ticket has 681 unique values, it is very difficult to extract
meaning, when it is converted to categorical data,
2. The size increases a lot and the curse of size occurs. Name is all unique data.
Therefore, it is difficult to extract meaning.
3. PassengerId is the same. Age_cut is created for analysis purposes.

We can try to remove or alter less important features in the future.

➔Filling in Missing Values

Missing values have been checked, and determined that in the training data 177 rows are
missing in the Age column, 687 rows are missing in the Cabin column and 2 rows are
missing in the Embarked column.

In the test data, we determined that 86 rows are missing in the Age column, 1 row is missing
in the Fare column and 327 rows are missing in the Cabin column.

Deleting these data might come harmful, so first we filled them with mean values.
Cabin data is deleted, the reason is too many empty rows and the information might not be as
impactful.
➔Removing Outliers

To see the outliers better, we used boxplots.

There are several ways to handle outliers. We decided to start with taking them and giving
them values of the upper or lower bounds depending on which side they are. Depending on
the results, the next techniques we’ll use are deleting the outliers and using Z-Score.

➔Data Transformation
◆For Categorical:

“Sex” and “Embarked” will be considered as nominal categories because they do not indicate
ordinal status (distinction between classes as upper and lower case, etc.)

“Fare” is normalized using Z-score.

“Age” is normalized using Z-score.

Results after the first iteration of data preparation:

Model used: RandomForestClassifier

Parameters: n_estimators=200, random_state=42

After optimizing the parameters:

n_estimators: 520, max_depth: 13, min_samples_split: 6,
min_samples_leaf: 5, max_features: None

We are still working on the data and the model to have even better results.

Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Titanic Data Analysis & Modeling
No ratings yet
Titanic Data Analysis & Modeling
11 pages
1.1 Loading The Data: Survival by Sex
No ratings yet
1.1 Loading The Data: Survival by Sex
6 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
66 pages
Statistics IMP Questions and Answers
No ratings yet
Statistics IMP Questions and Answers
23 pages
Machine Learning Project Report
No ratings yet
Machine Learning Project Report
65 pages
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
100% (4)
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
36 pages
1.1 Objective: 2. Data Preparation and Exploratory Analysis
No ratings yet
1.1 Objective: 2. Data Preparation and Exploratory Analysis
11 pages
Logistic Regression in R: Titanic Dataset
No ratings yet
Logistic Regression in R: Titanic Dataset
8 pages
Python For Data Sceince l1 Hands On
No ratings yet
Python For Data Sceince l1 Hands On
5 pages
Monika Sree 11-07-2024
No ratings yet
Monika Sree 11-07-2024
36 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
Titanic Survival Prediction Model Report
No ratings yet
Titanic Survival Prediction Model Report
3 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
85 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Aim: Predicting The Survival of Titanic Passengers
No ratings yet
Aim: Predicting The Survival of Titanic Passengers
20 pages
Overview of Data Mining Process
No ratings yet
Overview of Data Mining Process
43 pages
Asg One
No ratings yet
Asg One
10 pages
Advance Python
No ratings yet
Advance Python
5 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
Data Preparation Notebook
No ratings yet
Data Preparation Notebook
14 pages
Business Report PM Suchita Bhovar March 10 2024
No ratings yet
Business Report PM Suchita Bhovar March 10 2024
27 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
69 pages
Data Mining Report
No ratings yet
Data Mining Report
12 pages
Practical Guide and Concepts Data Mining
No ratings yet
Practical Guide and Concepts Data Mining
63 pages
Machine Learning Business Report
75% (55)
Machine Learning Business Report
60 pages
UNIT02
No ratings yet
UNIT02
41 pages
Dsi237 Group 2
No ratings yet
Dsi237 Group 2
27 pages
Data Wrangling Assignment Guide
No ratings yet
Data Wrangling Assignment Guide
4 pages
Outliers in Machine Learning
No ratings yet
Outliers in Machine Learning
13 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
AI Lab5
No ratings yet
AI Lab5
5 pages
Predictive Modeling Project
No ratings yet
Predictive Modeling Project
16 pages
Date Preparation and Exploration:: Titanic Data - CSV
No ratings yet
Date Preparation and Exploration:: Titanic Data - CSV
5 pages
04 DS 2023
No ratings yet
04 DS 2023
63 pages
Titanic Data Analysis & Modeling
No ratings yet
Titanic Data Analysis & Modeling
12 pages
Eda U2
No ratings yet
Eda U2
141 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Daa 2425
No ratings yet
Daa 2425
28 pages
Data Preprocessing Techniques in ML
No ratings yet
Data Preprocessing Techniques in ML
12 pages
Feature Engineering 1708311524
No ratings yet
Feature Engineering 1708311524
48 pages
DS For Business Home Assignments
No ratings yet
DS For Business Home Assignments
24 pages
House Price Prediction for Analysts
No ratings yet
House Price Prediction for Analysts
91 pages
Academic Performance Data Wrangling
No ratings yet
Academic Performance Data Wrangling
9 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
Compare and Contrast CSV, JSON, and XML Dataset Formats. Which Format Would You Choose For Image Data and Why?
No ratings yet
Compare and Contrast CSV, JSON, and XML Dataset Formats. Which Format Would You Choose For Image Data and Why?
9 pages
Unit 1
No ratings yet
Unit 1
21 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
ML 8 Program
No ratings yet
ML 8 Program
5 pages
Predictive Modelling ALOK KUMAR
100% (1)
Predictive Modelling ALOK KUMAR
25 pages
ML Ex2
No ratings yet
ML Ex2
7 pages
Module 2 Data Preprocessing
No ratings yet
Module 2 Data Preprocessing
31 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Chapter 2
No ratings yet
Chapter 2
46 pages
Coding Titanicmain
No ratings yet
Coding Titanicmain
58 pages
Cab Fare Prediction Report by Abhinav Jha
No ratings yet
Cab Fare Prediction Report by Abhinav Jha
41 pages
PS 05 Fama French 1988
No ratings yet
PS 05 Fama French 1988
29 pages
c26 Btest-15 Math Paper
No ratings yet
c26 Btest-15 Math Paper
17 pages
Numerical Analysis Prelim Guide
No ratings yet
Numerical Analysis Prelim Guide
236 pages
Aimo 2019 Trial g2
100% (1)
Aimo 2019 Trial g2
6 pages
Syllabus 2020
No ratings yet
Syllabus 2020
2 pages
CS-30013 (DMDW) - CS End Nov 2024
No ratings yet
CS-30013 (DMDW) - CS End Nov 2024
21 pages
SASMO 2014 Round 1 Secondary 1 Problems
100% (1)
SASMO 2014 Round 1 Secondary 1 Problems
3 pages
Dushu # Unit-3, 4 Ru TK
No ratings yet
Dushu # Unit-3, 4 Ru TK
22 pages
Construction Technology II Lesson 4
No ratings yet
Construction Technology II Lesson 4
24 pages
Adaptive and Array Signal Processing
No ratings yet
Adaptive and Array Signal Processing
44 pages
Finding Efficient Portfolios Using Machine Learning - Claudio Salvetti 03-2025
No ratings yet
Finding Efficient Portfolios Using Machine Learning - Claudio Salvetti 03-2025
25 pages
Module 1c Augmented Matrices 1
No ratings yet
Module 1c Augmented Matrices 1
12 pages
Mixture - Alligation
No ratings yet
Mixture - Alligation
7 pages
Explicit Dynamics Chapter 6 Explicit Meshing
No ratings yet
Explicit Dynamics Chapter 6 Explicit Meshing
50 pages
Physics Chapter 3 f4 KSSM (SPM Notes 4.0)
No ratings yet
Physics Chapter 3 f4 KSSM (SPM Notes 4.0)
18 pages
The Influence of Leadership Style, Organizational Culture, and Job Satisfaction On Employee Performance Department of Education and Culture of Yapen Islands
No ratings yet
The Influence of Leadership Style, Organizational Culture, and Job Satisfaction On Employee Performance Department of Education and Culture of Yapen Islands
14 pages
JR - JEYAGANESH, II MCA, University of Madras, Chennai-5
100% (2)
JR - JEYAGANESH, II MCA, University of Madras, Chennai-5
8 pages
Fundamentals of AI Course Overview
No ratings yet
Fundamentals of AI Course Overview
2 pages
Elements of Fracture Mechanics: Birla Institute of Technology & Science Pilani (Rajasthan)
No ratings yet
Elements of Fracture Mechanics: Birla Institute of Technology & Science Pilani (Rajasthan)
4 pages
Module 3 - Line
No ratings yet
Module 3 - Line
2 pages
Solution Manual For Precalculus Functions and Graphs 13th Edition
No ratings yet
Solution Manual For Precalculus Functions and Graphs 13th Edition
63 pages
Optimization of A Propeller Fan For Noise and Efficiency by Using 3D Inverse Design Method
No ratings yet
Optimization of A Propeller Fan For Noise and Efficiency by Using 3D Inverse Design Method
12 pages
Shaft Alignment Using Strain Gauges - Case Studies
No ratings yet
Shaft Alignment Using Strain Gauges - Case Studies
16 pages
BVP for ODEs in Finite Difference Methods
No ratings yet
BVP for ODEs in Finite Difference Methods
31 pages
Feature Engineering
No ratings yet
Feature Engineering
6 pages
Physics of Magnetic Fields
No ratings yet
Physics of Magnetic Fields
11 pages
Smoothing Frequency Domain Filters
No ratings yet
Smoothing Frequency Domain Filters
22 pages
Digital Transmission: Line and Block Coding
No ratings yet
Digital Transmission: Line and Block Coding
30 pages
Year 4 Maths Grab Pack 2
No ratings yet
Year 4 Maths Grab Pack 2
11 pages
Ampere's Law: Paul Padley
No ratings yet
Ampere's Law: Paul Padley
3 pages

Data Mining - Data Preparation Report

Uploaded by

Data Mining - Data Preparation Report

Uploaded by

Data Mining - Data Preparation Report

Alper Başdoğan 19155035

At this step we tried to determine the redundancy of some attributes.

We can try to remove or alter less important features in the future.

➔​Filling in Missing Values

To see the outliers better, we used boxplots.

“Fare” is normalized using Z-score.

“Age” is normalized using Z-score.

Model used: RandomForestClassifier

After optimizing the parameters:

You might also like

➔Filling in Missing Values