0% found this document useful (0 votes)
31 views4 pages

Data Mining - Data Preparation Report

The report outlines the initial data preparation steps for a data mining project, including filling missing values, handling outliers, and normalization techniques. Key steps include data selection, binning of age categories, and data reduction by removing redundant attributes. The first iteration of data preparation utilized a RandomForestClassifier with optimized parameters, although further work is needed to improve results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views4 pages

Data Mining - Data Preparation Report

The report outlines the initial data preparation steps for a data mining project, including filling missing values, handling outliers, and normalization techniques. Key steps include data selection, binning of age categories, and data reduction by removing redundant attributes. The first iteration of data preparation utilized a RandomForestClassifier with optimized parameters, although further work is needed to improve results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Mining - Data Preparation Report

Alper Başdoğan 19155035


Uğurcan Doğan 20155071
Ömer Faruk Sanlı 20155806

Before the trial-and-test stage, some data preparation steps are completed because we are not
planning to change them in order to have different outcomes. On the other hand, the variable
steps here are the following:
●​ Filling of the missing values.
●​ The way we deal with outliers.
●​ Normalization technique.
Above steps have different ways to be achieved, and all can alter the performance of the
model. In this report, full iteration has not been completed yet and the first versions of each
step is used to get the first results.

Steps

➔​Data Selection
Our first step was to merge the test set with ground truth survival labels for evaluation.

➔​Binning
Ages are categorized into discrete values of [0, 18, 25, 40, 60, 75, 90]. Bin ranges might be
changed in the future to see if they’ll have any effect on the end results.

➔​Data Reduction
◆​Remove redundant attributes:

At this step we tried to determine the redundancy of some attributes.


1.​ Train data has 891 rows. Ticket has 681 unique values, it is very difficult to extract
meaning, when it is converted to categorical data,
2.​ The size increases a lot and the curse of size occurs. Name is all unique data.
Therefore, it is difficult to extract meaning.
3.​ PassengerId is the same. Age_cut is created for analysis purposes.

We can try to remove or alter less important features in the future.

➔​Filling in Missing Values

Missing values have been checked, and determined that in the training data 177 rows are
missing in the Age column, 687 rows are missing in the Cabin column and 2 rows are
missing in the Embarked column.

In the test data, we determined that 86 rows are missing in the Age column, 1 row is missing
in the Fare column and 327 rows are missing in the Cabin column.

Deleting these data might come harmful, so first we filled them with mean values.
Cabin data is deleted, the reason is too many empty rows and the information might not be as
impactful.
➔​Removing Outliers

To see the outliers better, we used boxplots.

There are several ways to handle outliers. We decided to start with taking them and giving
them values of the upper or lower bounds depending on which side they are. Depending on
the results, the next techniques we’ll use are deleting the outliers and using Z-Score.

➔​Data Transformation
◆​For Categorical:

“Sex” and “Embarked” will be considered as nominal categories because they do not indicate
ordinal status (distinction between classes as upper and lower case, etc.)

“Fare” is normalized using Z-score.

“Age” is normalized using Z-score.


Results after the first iteration of data preparation:

Model used: RandomForestClassifier


Parameters: n_estimators=200, random_state=42

After optimizing the parameters:


n_estimators: 520, max_depth: 13, min_samples_split: 6,
min_samples_leaf: 5, max_features: None

We are still working on the data and the model to have even better results.

You might also like