Understanding Data
Understanding Data
If you want to be the first to be informed about new projects, please do not forget
to follow us - by Fatma Nur AZMAN
Fatmanurazman.com | Linkedin | Github | Kaggle | Tableau
Project Description:
Adult Income Prediction This dataset was obtained from UCI Machine Learning
Repository. The aim of this problem is to classify adults in two different groups
based on their income where group 1 has an income less than USD 50k and group
2 has an income of more than or equal to USD 50k. The data available at hand
comes from Census 1994.
Domain Knowledge:
Economic Conditions
Technological Revolution:
At the beginning of the 1990s, the widespread adoption of the internet and the rapid
development of computer technology led to significant changes in the labor market.
Information technology and service sectors grew rapidly, creating many new job
opportunities.
Economic Growth:
The US economy entered a significant growth period from the mid-1990s. This growth
was supported by low inflation and low unemployment rates. However, economic
opportunities were not equally distributed across all regions and groups.
The increasing importance of education levels in the labor market directly affected
individuals' income levels. Higher-educated individuals generally worked in higher-
paying jobs, while lower-educated individuals had to work in low-wage jobs.
Demographic Changes
Aging Population:
The aging of the baby boomer generation began to put pressure on social security
systems and healthcare services. The increasing number of individuals reaching
retirement age also led to changes in the labor market.
Women's participation in the workforce increased significantly in the 1990s. This led to
an increase in household incomes and changes in gender roles in society.
Sectoral Changes
In the 1990s, while the manufacturing industry declined in some regions, the service
and technology-based sectors grew. This transformation led to increased
unemployment rates in some areas and economic imbalances.
Globalization:
Globalization led to increased trade and investments. Many US companies moved their
production facilities abroad while gaining access to global markets. This caused some
uncertainties and changes in the labor market.
In this context, the data obtained from the 1994 Census reflects the aforementioned
economic, social, and demographic changes. By examining the impact of education
levels, gender, race, and occupations on income in the labor market, we can better
understand the social dynamics of that period. These analyses can also contribute to
understanding the changes and continuities in comparison with today's conditions.
Rows: 32561
Columns: 15
Attribute
STT Unique Values
Name
education-
5 Number of years spent in education. Continuous.
num
Represents the profit an individual makes from the sale of assets (e.g.,
11 capital-gain
stocks or real estate). Continuous.
Represents the loss an individual incurs from the sale of assets (e.g.,
12 capital-loss
stocks or real estate). Continuous.
hours-per-
13 Continuous.
week
Table of CONTENTS
localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26 … 3/64
13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
In [3]: df.shape
In [4]: df.head()
Exec-
1 82 Private 132870 HS-grad 9 Widowed
managerial
Some-
2 66 ? 186061 10 Widowed ? U
college
Machine-
3 54 Private 140359 7th-8th 4 Divorced U
op-inspct
Some- Prof-
4 41 Private 264663 10 Separated O
college specialty
In [6]: df.tail()
Married-civ- Machine-
32558 40 Private 154374 HS-grad 9
spouse op-inspct
Adm-
32559 58 Private 151910 HS-grad 9 Widowed
clerical
Never- Adm-
32560 22 Private 201490 HS-grad 9
married clerical
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 32561 non-null int64
1 workclass 32561 non-null object
2 fnlwgt 32561 non-null int64
3 education 32561 non-null object
4 education.num 32561 non-null int64
5 marital.status 32561 non-null object
6 occupation 32561 non-null object
7 relationship 32561 non-null object
8 race 32561 non-null object
9 sex 32561 non-null object
10 capital.gain 32561 non-null int64
11 capital.loss 32561 non-null int64
12 hours.per.week 32561 non-null int64
13 native.country 32561 non-null object
14 income 32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 32561 non-null int64
1 workclass 30725 non-null object
2 fnlwgt 32561 non-null int64
3 education 32561 non-null object
4 education.num 32561 non-null int64
5 marital.status 32561 non-null object
6 occupation 30718 non-null object
7 relationship 32561 non-null object
8 race 32561 non-null object
9 sex 32561 non-null object
10 capital.gain 32561 non-null int64
11 capital.loss 32561 non-null int64
12 hours.per.week 32561 non-null int64
13 native.country 31978 non-null object
14 income 32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
In [9]: df.describe().T
In [10]: df.describe(include="object").T
In [6]: df.duplicated().sum()
Out[6]: 24
In [62]: duplicate_values(df)
Duplicate check...
There are 24 duplicated observations in the dataset.
24 duplicates were dropped!
No more duplicate rows!
In [9]: df.isnull().sum().sum()
Out[9]: 4261
Features Summary
In [15]: # !pip install ipywidgets ydata-profiling
#from ydata_profiling import ProfileReport
#profile = ProfileReport(df, title="Profiling Report")
#profile.to_file("profiling_report.html")
Freqs / (% of
No Variable Stats / Values Graph Missi
Valid)
1. Private
22,673 (69.7%)
2. Self-emp-not-
2,540 (7.8%)
inc
2,093 (6.4%)
3. Local-gov
1,836 (5.6%)
workclass 4. nan 1,836
2 1,298 (4.0%)
[object] 5. State-gov (5.6%)
1,116 (3.4%)
6. Self-emp-inc
960 (3.0%)
7. Federal-gov
14 (0.0%)
8. Without-pay
7 (0.0%)
9. Never-worked
Mean (sd) :
189780.8
(105556.5)
min < med < max:
fnlwgt 21,648 distinct 0
3 12285.0 <
[int64] values (0.0%)
178356.0 <
1484705.0
IQR (CV) :
119166.0 (1.8)
Freqs / (% of
No Variable Stats / Values Graph Missi
Valid)
1. Married-civ-
spouse
14,970 (46.0%)
2. Never-married
10,667 (32.8%)
3. Divorced
4,441 (13.6%)
marital.status 4. Separated 0
6 1,025 (3.2%)
[object] 5. Widowed (0.0%)
993 (3.1%)
6. Married-
418 (1.3%)
spouse-absent
23 (0.1%)
7. Married-AF-
spouse
1. Prof-specialty
2. Craft-repair
4,136 (12.7%)
3. Exec-managerial
4,094 (12.6%)
4. Adm-clerical
4,065 (12.5%)
5. Sales
3,768 (11.6%)
6. Other-service
3,650 (11.2%)
occupation 7. Machine-op- 1,843
7 3,291 (10.1%)
[object] inspct (5.7%)
2,000 (6.1%)
8. nan
1,843 (5.7%)
9. Transport-
1,597 (4.9%)
moving
1,369 (4.2%)
10. Handlers-
2,724 (8.4%)
cleaners
11. other
1. White
2. Black 27,795 (85.4%)
3. Asian-Pac- 3,122 (9.6%)
race 0
9 Islander 1,038 (3.2%)
[object] (0.0%)
4. Amer-Indian- 311 (1.0%)
Eskimo 271 (0.8%)
5. Other
Freqs / (% of
No Variable Stats / Values Graph Missi
Valid)
plt.figure(figsize=(15, 5 * num_rows))
for i, col in enumerate(df.iloc[:, :-1].columns, 1):
plt.subplot(num_rows, 3, i)
plt.title(f"Distribution of {col} Data")
sns.histplot(df[col], kde=True)
plt.tight_layout()
plt.show()
Out[10]: Skew
capital.gain 11.949403
capital.loss 4.592702
fnlwgt 1.447703
Out[63]: ['workclass',
'education',
'marital.status',
'occupation',
'relationship',
'race',
'sex',
'native.country']
Out[24]: 4261
age 0 0.00
fnlwgt 0 0.00
education 0 0.00
education.num 0 0.00
marital.status 0 0.00
relationship 0 0.00
race 0 0.00
sex 0 0.00
capital.gain 0 0.00
capital.loss 0 0.00
hours.per.week 0 0.00
income 0 0.00
msno.matrix(df);
def get_unique_values(df):
output_data = []
return output_df
In [28]: get_unique_values(df)
0 age 73 - float64
3 education 16 - object
4 education.num 16 - float64
6 occupation 14 - object
11 capital.loss 92 - float64
12 hours.per.week 94 - float64
13 native.country 41 - object
fig.show()
Categorical Features
In [11]: df[cat_features].columns
counts = df['workclass'].value_counts().reindex(sorted_workclass[::-1])
counts.plot(kind="barh", ax=ax1, color="teal")
ax1.set_title('Workclass', fontsize=16)
ax1.bar_label(ax1.containers[0], labels=counts.values, fontsize=12)
ax1.tick_params(axis='y', labelsize=16)
General Insights
The private sector is the most dominant category among the work classes and
creates a significant disparity in income distribution.
For local, state, and federal government jobs, the low-income category is
dominant; however, a significant portion also falls into the high-income category.
Individuals who work without pay and those who have never worked are generally
found in the low-income category.
plt.tight_layout()
plt.show()
Category Merging: Dividing education levels into too many categories can complicate
data analysis and modeling processes. Therefore, similar levels have been combined to
form larger and more meaningful categories.
plt.tight_layout()
plt.show()
In [69]: df['marital.status'].replace(
['Never-married'], 'NotMarried', inplace=True
)
df['marital.status'].replace(
['Married-AF-spouse', 'Married-civ-spouse'], 'Married', inplace=True
)
df['marital.status'].replace(
['Married-spouse-absent', 'Separated'], 'Separated', inplace=True
)
df['marital.status'].replace(
['Divorced', 'Widowed'], 'Widowed', inplace=True
)
Marital Status Categories Merging In order to simplify the analysis and improve
model performance, we combined similar marital status categories. This helps in
reducing the number of distinct categories, making the data more manageable and the
results more interpretable.
General Insights
Marriage and Age: Marriage rates are low among young adults, peak in middle age,
and decline again in older age. This indicates that focusing on education and career is
common in early life, marriage and family building are more prevalent in middle age,
and loss of a spouse increases in older age.
Tendency Not to Marry: The non-marriage rates are higher among younger age
groups, suggesting that education and career-oriented lifestyles are more common
in modern societies.
General Insights
General Insights:
Data Imbalance:
The data is heavily skewed towards individuals from the United States, which could
impact the generalizability of any models or analyses performed.
The dataset is predominantly composed of individuals from the United States, with a
minor but noticeable representation from Mexico and a variety of other countries. This
heavy imbalance towards the US population suggests the need for careful handling of
data to avoid biases.
Given the significant representation from Mexico, segmented analyses (e.g., comparing
outcomes between US natives and Mexican immigrants) might be feasible and
insightful.
For other countries with smaller representations, aggregated analyses might be more
appropriate.
Numerical Features
In [54]: df['age_bin'] = pd.cut(df['age'], bins=20)
Purpose: To combine the capital.gain (capital gain) and capital.loss (capital loss)
columns into a single column to calculate the net capital gain.
Out[44]: income
0 145
1 63
Name: count, dtype: int64
Out[45]: income
0 892
1 81
Name: count, dtype: int64
The code segments are used to analyze the income status of individuals with extremely
high or low weekly working hours and to remove these outliers from the dataset.
fnlwgt: As a result of the analysis, the effect of fnlwgt on the model is almost
negligible. Therefore, it was excluded from the data.
In [48]: df.shape
In [49]: df.columns
Correlation
In [79]: numeric_df = df.select_dtypes(include=['number'])
corr_matrix = numeric_df.corr()
plt.figure(figsize=(2, 7))
sns.heatmap(df_corr_target[[target_variable]], annot=True, vmin=-1, vmax=1,
plt.title(f'Correlation with {target_variable}')
plt.show()
plot_target_correlation_heatmap(df, 'income')
Multicollinearity
In [52]: def color_correlation1(val):
"""
Takes a scalar and returns a string with
the css property in a variety of color scales
for different correlations.
"""
if val >= 0.6 and val < 0.99999 or val <= -0.6 and val > -0.99999:
color = 'red'
elif val < 0.6 and val >= 0.3 or val > -0.6 and val <= -0.3:
color = 'blue'
elif val == 1:
color = 'green'
else:
color = 'black'
return 'color: %s' % color
numeric_df = df.select_dtypes(include=[np.number])
numeric_df.corr().style.applymap(color_correlation1)
Models
make_column_transformer
In [24]: df.columns
In [82]: cat_onehot = [
'workclass', 'occupation', 'relationship', 'race', 'sex', 'native.country',
'marital.status'
]
cat_ordinal = ['education', 'capital_diff']
cat_for_edu = [
'Preschool', 'elementary_school', 'secondary_school', 'HS-grad',
'Some-college', 'Assoc', 'Bachelors', 'Masters', 'Prof-school', 'Doctorate
]
cat_for_capdiff = ['Low', 'High']
pipe_model = Pipeline(steps=operations)
pipe_model.fit(X_train, y_train)
Out[84]: ▸ Pipeline i ?
▸ transformer: ColumnTransformer ?
▸ ▸ ▸
OneHotEncoder OrdinalEncoder StandardScaler
? ? ?
▸ LogisticRegression ?
In [85]: ConfusionMatrixDisplay.from_estimator(pipe_model,
X_test,
y_test,
normalize='true');
logistic Test_Set
[[4443 290]
[ 636 903]]
precision recall f1-score support
logistic Train_Set
[[17632 1296]
[ 2528 3628]]
precision recall f1-score support
Cross Validate
In [113… operations = [("transformer", column_trans), ("logistic", LogisticRegression(ma
pipecv_model = Pipeline(steps=operations)
cv = StratifiedKFold(n_splits=10)
scores = cross_validate(pipecv_model,
X_train,
y_train,
scoring=["accuracy", "precision", "recall", "f1"],
cv=cv,
return_train_score = True)
df_scores = pd.DataFrame(scores, index=range(1,11))
df_scores.mean()[2:]
GridSearchCV
param_grid = [ { "logistic__penalty" : ['l1', 'l2'], "logistic__C" : [0.01, 0.05,0.03, 0.1, 1],
"logistic__class_weight": ["balanced", None] , "logistic__solver": ['liblinear', 'saga', 'lbfgs'],
"logistic__max_iter": [1000, 2000] } ]
Many grids of money have been tried. Finally, the following features were identified.
log_model = Pipeline(steps=operations)
param_grid = [
{
"logistic__penalty" : ['l1'],
"logistic__C" : [0.03],
"logistic__class_weight": ["balanced"] ,
"logistic__solver": ['saga'],
"logistic__max_iter": [1000]
}
]
cv = StratifiedKFold(n_splits = 10)
grid_model = GridSearchCV(estimator=log_model,
param_grid=param_grid,
cv=cv,
scoring = "f1",
n_jobs = -1,
return_train_score=True).fit(X_train, y_train)
In [90]: grid_model.best_estimator_
Out[90]: ▸ Pipeline i ?
▸ transformer: ColumnTransformer ?
▸ ▸ ▸
OneHotEncoder OrdinalEncoder StandardScaler
? ? ?
▸ LogisticRegression ?
In [91]: grid_model.best_score_
Out[91]: 0.683264633932356
In [92]: grid_model.best_index_
Out[92]: 0
logisticgrid Test_Set
[[3783 950]
[ 253 1286]]
precision recall f1-score support
logisticgrid Train_Set
[[15047 3881]
[ 952 5204]]
precision recall f1-score support
KNN Model
In [39]: operations = [("transformer", column_trans), ("knn", KNeighborsClassifier())]
pipe_model = Pipeline(steps=operations)
pipe_model.fit(X_train, y_train)
Out[39]: ▸ Pipeline i ?
▸ transformer: ColumnTransformer ?
▸ ▸ ▸
OneHotEncoder OrdinalEncoder StandardScaler
? ? ?
▸ KNeighborsClassifier ?
knn Test_Set
[[4286 447]
[ 655 884]]
precision recall f1-score support
knn Train_Set
[[17746 1182]
[ 1881 4275]]
precision recall f1-score support
pipe_model = Pipeline(steps=operations)
pipe_model.fit(X_train, y_train)
Out[98]: ▸ Pipeline i ?
▸ transformer: ColumnTransformer ?
▸ ▸ ▸
OneHotEncoder OrdinalEncoder StandardScaler
? ? ?
▸ KNeighborsClassifier ?
In [172… test_error_rates = []
knn_pipe_model = Pipeline(steps=operations)
f1_mean = scores["test_f1"].mean()
test_error = 1 - f1_mean
test_error_rates.append(test_error)
knn_pipe_model = Pipeline(steps=operations)
knn_pipe_model.fit(X_train, y_train)
f1_test_mean = scores["test_f1"].mean()
f1_train_mean = scores["train_f1"].mean()
test_error = 1 - f1_test_mean
train_error = 1 -f1_train_mean
test_error_rates.append(test_error)
train_error_rates.append(train_error)
plt.plot(range(1, 10),
train_error_rates,
color='red',
marker='o',
markerfacecolor='green',
markersize=10)
for i in k_list:
operations = [("transformer", column_trans), ("knn", KNeighborsClassifier(n
knn = Pipeline(steps=operations)
knn.fit(X_train, y_train)
print(f'WITH K={i}\n')
eval_metric(knn, X_train, y_train, X_test, y_test, "knn_elbow")
WITH K=3
knn_elbow Test_Set
[[4236 497]
[ 684 855]]
precision recall f1-score support
knn_elbow Train_Set
[[17848 1080]
[ 1568 4588]]
precision recall f1-score support
WITH K=5
knn_elbow Test_Set
[[4286 447]
[ 655 884]]
precision recall f1-score support
knn_elbow Train_Set
[[17746 1182]
[ 1881 4275]]
precision recall f1-score support
WITH K=7
knn_elbow Test_Set
[[4315 418]
[ 647 892]]
precision recall f1-score support
knn_elbow Train_Set
[[17663 1265]
[ 2017 4139]]
precision recall f1-score support
model = Pipeline(steps=operations)
scores = cross_validate(model,
X_train,
y_train,
scoring=['accuracy', 'precision', 'recall', 'f1'],
cv=10,
return_train_score=True)
df_scores = pd.DataFrame(scores, index=range(1, 11))
df_scores.mean()[2:]
Many grids of money have been tried. Finally, the following features were
identified.Tried values up to k_values = 30.
In [101… param_grid = [
{
"knn__n_neighbors": [19],
"knn__metric": ['euclidean'],
"knn__weights": ['uniform']
}
]
knn_grid_model = GridSearchCV(knn_model,
param_grid,
scoring='f1',
cv=5,
return_train_score=True,
n_jobs=-1).fit(X_train, y_train)
In [102… knn_grid_model.best_estimator_
Out[102… ▸ Pipeline i ?
▸ transformer: ColumnTransformer ?
▸ ▸ ▸
OneHotEncoder OrdinalEncoder StandardScaler
? ? ?
▸ KNeighborsClassifier ?
In [103… knn_grid_model.best_index_
Out[103… 0
In [104… pd.DataFrame(
knn_grid_model.cv_results_).loc[0,["mean_test_score", "mean_train_score"]]
In [105… knn_grid_model.best_score_
Out[105… 0.6396999259520801
knn_grid Test_Set
[[4367 366]
[ 628 911]]
precision recall f1-score support
knn_grid Train_Set
[[17492 1436]
[ 2269 3887]]
precision recall f1-score support
As a result of the values we gave to K, the tests did not improve, but we
prevented overfitting and found more reliable results
Out[191… 0.859903581601922
SVM Model
In [107… operations = [("transformer", column_trans),("SVC", SVC(random_state=42))]
pipe_model = Pipeline(steps=operations)
pipe_model.fit(X_train, y_train)
Out[107… ▸ Pipeline i ?
▸ transformer: ColumnTransformer ?
▸ ▸ ▸
OneHotEncoder OrdinalEncoder StandardScaler
? ? ?
▸ SVC ?
Model Performance
In [194… eval_metric(pipe_model, X_train, y_train, X_test, y_test, "svm")
svm Test_Set
[[4498 235]
[ 692 847]]
precision recall f1-score support
svm Train_Set
[[17901 1027]
[ 2756 3400]]
precision recall f1-score support
pipe_model = Pipeline(steps=operations)
cv = StratifiedKFold(n_splits=5)
scores = cross_validate(pipe_model,
X_train,
y_train,
scoring=['accuracy', 'precision', 'recall', 'f1'],
cv=cv,
return_train_score=True,
n_jobs=-1)
GridsearchCV
param_grid = {'SVC__C': [0.01, 0.1, 1, 10, 100], 'SVC__gamma': ["scale", "auto", 0.001,
0.01, 0.1, 0.5], 'SVC__kernel': ['rbf', 'linear'],}
Many grids of money have been tried. Finally, the following features were identified.
svm_model_grid = GridSearchCV(pipe_model,
param_grid,
scoring="recall_macro",
cv=5,
return_train_score=True,
n_jobs=2,
verbose=2).fit(X_train, y_train)
In [109… svm_model_grid.best_estimator_
Out[109… ▸ Pipeline i ?
▸ transformer: ColumnTransformer ?
▸ ▸ ▸
OneHotEncoder OrdinalEncoder StandardScaler
? ? ?
▸ SVC ?
In [110… svm_model_grid.best_index_
Out[110… 0
In [111… pd.DataFrame(
svm_model_grid.cv_results_).loc[0,
["mean_test_score", "mean_train_score"]]
In [112… svm_model_grid.best_score_
Out[112… 0.7731760615298443
svm_grid Test_Set
[[4409 324]
[ 607 932]]
precision recall f1-score support
svm_grid Train_Set
[[17835 1093]
[ 2040 4116]]
precision recall f1-score support
Out[114… 0.731964885295869
plt.subplot(411)
compare = compare.sort_values(by="F1", ascending=False)
ax=sns.barplot(x="F1", y="Model", data=compare, palette="magma")
labels(ax)
plt.subplot(412)
compare = compare.sort_values(by="Recall", ascending=False)
ax=sns.barplot(x="Recall", y="Model", data=compare, palette="magma")
labels(ax)
plt.subplot(413)
compare = compare.sort_values(by="ROC_AUC", ascending=False)
ax=sns.barplot(x="ROC_AUC", y="Model", data=compare, palette="magma")
labels(ax)
plt.subplot(414)
compare = compare.sort_values(by="PRC", ascending=False)
ax=sns.barplot(x="PRC", y="Model", data=compare, palette="magma")
labels(ax)
plt.show()
log_model = Pipeline(steps=operations)
param_grid = [
{
"logistic__penalty" : ['l1'],
"logistic__C" : [0.03],
"logistic__class_weight": ["balanced"] ,
"logistic__solver": ['saga'],
"logistic__max_iter": [1000]
}
]
cv = StratifiedKFold(n_splits = 10)
final_pipe_model = GridSearchCV(estimator=log_model,
param_grid=param_grid,
cv=cv,
scoring = "f1",
n_jobs = -1,
return_train_score=True).fit(X, y)
Out[120… ▸ GridSearchCV i ?
▸ best_estimator_: Pipeline
▸ transformer: ColumnTransformer ?
▸ ▸ ▸
OneHotEncoder OrdinalEncoder StandardScaler
? ? ?
▸ LogisticRegression ?
Prediction
In [126… my_dict= {
'age': [44.0, 32.0, 30.0],
'workclass': ['Federal-gov', 'Private', 'Self-emp-not-inc'],
'education': ['Bachelors', 'Bachelors', 'Some-college'],
'education.num': [13.0, 13.0, 10.0],
'marital.status': ['Widowed', 'Married', 'NotMarried'],
'occupation': ['Tech-support', 'Sales', 'Sales'],
'relationship': ['Not-in-family', 'Husband', 'Other-relative'],
'race': ['White', 'White', 'Others'],
'sex': ['Male', 'Male', 'Male'],
'hours.per.week': [40.0, 40.0, 40.0],
'native.country': ['United-States', 'United-States', 'Other'],
'capital_diff': ['Low', 'Low', 'Low']
}
In [129… new_model.predict(sample)
In [130… new_model.decision_function(sample)
📂 Conclusion
In [ ]: # Logistic grid recall: 83, f1 : 0.68 prc=0.75
In an unbalanced dataset, F1-Score and Recall metrics are indeed very important.
These metrics play a critical role in evaluating model performance in unbalanced
datasets, as they measure the model's ability to correctly predict the minority class.
The Logistic Regression model stands out with a Recall of 0.83 and an F1-Score of
0.68. This model demonstrates balanced performance across the classes in the
unbalanced dataset, effectively capturing the minority class while also performing
well in overall classification.
The KNN Model, although it performs well in terms of accuracy, lags behind
Logistic Regression with a Recall of 0.59 and an F1-Score of 0.64. This indicates that
the model is less effective at capturing the minority class in the unbalanced
dataset.
The SVM Model, despite excelling in accuracy, also falls behind Logistic Regression
in these two metrics with a Recall of 0.60 and an F1-Score of 0.66. It is evident that
SVM is not sufficiently successful in capturing the minority class.
Based on these results, I can say that the Logistic Regression model offers the best
performance in terms of Recall and F1-Score for unbalanced datasets and should
therefore be preferred. Especially in unbalanced datasets, it is critical that the
model correctly identifies the minority class, making Logistic Regression the most
suitable choice.
THANK YOU
If you want to be the first to be informed about new projects, please do not
forget to follow us - by Fatma Nur AZMAN
Fatmanurazman.com | Linkedin | Github | Kaggle | Tableau