0% found this document useful (0 votes)
57 views64 pages

Understanding Data

Uploaded by

sunil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views64 pages

Understanding Data

Uploaded by

sunil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

13.08.

2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Adult Income Prediction

If you want to be the first to be informed about new projects, please do not forget
to follow us - by Fatma Nur AZMAN
Fatmanurazman.com | Linkedin | Github | Kaggle | Tableau

Understanding The Data

Project Description:
Adult Income Prediction This dataset was obtained from UCI Machine Learning
Repository. The aim of this problem is to classify adults in two different groups
based on their income where group 1 has an income less than USD 50k and group
2 has an income of more than or equal to USD 50k. The data available at hand
comes from Census 1994.

Domain Knowledge:
Economic Conditions

Technological Revolution:

At the beginning of the 1990s, the widespread adoption of the internet and the rapid
development of computer technology led to significant changes in the labor market.
Information technology and service sectors grew rapidly, creating many new job
opportunities.

Economic Growth:

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26 … 1/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

The US economy entered a significant growth period from the mid-1990s. This growth
was supported by low inflation and low unemployment rates. However, economic
opportunities were not equally distributed across all regions and groups.

Social and Political Situation

Diversity and Immigration:

In the 1990s, the number of people immigrating to the US increased. Immigrants


played a crucial role in the labor market and met labor demands in many sectors. This
situation also led to some social tensions and debates.

Education and Workforce:

The increasing importance of education levels in the labor market directly affected
individuals' income levels. Higher-educated individuals generally worked in higher-
paying jobs, while lower-educated individuals had to work in low-wage jobs.

Demographic Changes

Aging Population:

The aging of the baby boomer generation began to put pressure on social security
systems and healthcare services. The increasing number of individuals reaching
retirement age also led to changes in the labor market.

Women's Participation in the Workforce:

Women's participation in the workforce increased significantly in the 1990s. This led to
an increase in household incomes and changes in gender roles in society.

Sectoral Changes

Transformation of the Manufacturing Industry:

In the 1990s, while the manufacturing industry declined in some regions, the service
and technology-based sectors grew. This transformation led to increased
unemployment rates in some areas and economic imbalances.

Globalization:

Globalization led to increased trade and investments. Many US companies moved their
production facilities abroad while gaining access to global markets. This caused some
uncertainties and changes in the labor market.

In this context, the data obtained from the 1994 Census reflects the aforementioned
economic, social, and demographic changes. By examining the impact of education
levels, gender, race, and occupations on income in the labor market, we can better
understand the social dynamics of that period. These analyses can also contribute to
understanding the changes and continuities in comparison with today's conditions.

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26 … 2/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

About the Dataset


Dataset Descriptions:

Rows: 32561
Columns: 15

Attribute
STT Unique Values
Name

1 Age Describes the age of individuals. Continuous.

Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-


2 Workclass
gov, Without-pay, Never-worked.

Continuous. This is a weighting factor created by the US Census


3 fnlwgt Bureau and indicates the number of people represented by each data
entry.

Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm,


4 education Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-
6th, Preschool.

education-
5 Number of years spent in education. Continuous.
num

marital- Married-civ-spouse, Divorced, Never-married, Separated, Widowed,


6
status Married-spouse-absent, Married-AF-spouse.

Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-


specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical,
7 occupation
Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv,
Armed-Forces.

8 relationship Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

9 race White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

10 sex Female, Male.

Represents the profit an individual makes from the sale of assets (e.g.,
11 capital-gain
stocks or real estate). Continuous.

Represents the loss an individual incurs from the sale of assets (e.g.,
12 capital-loss
stocks or real estate). Continuous.

hours-per-
13 Continuous.
week

United-States, Cambodia, England, Puerto-Rico, Canada, Germany,


Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba,
native- Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico,
14
country Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan,
Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand,
Yugoslavia, El-Salvador, Trinidad & Tobago, Peru, Hong, Netherlands.

15 salary >50K, <=50K.

Table of CONTENTS
localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26 … 3/64
13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Understanding The Data


Exploratory Data Analysis (EDA)
Feature Engineering and Outliers
Correlation
Models
Logistic Regression Model
KNN Model
SVM Model
Compare Models Performance
Final Model and Model Deployment
Prediction
Conclusion

Import Libraries and Data Review


In [58]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

%matplotlib inline

from sklearn.impute import SimpleImputer

from scipy import stats


from sklearn.model_selection import train_test_split, GridSearchCV, cross_valid
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression


from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from sklearn.compose import make_column_transformer

from sklearn.metrics import make_scorer, precision_score, recall_score, f1_scor


from sklearn.metrics import PrecisionRecallDisplay, roc_curve, average_precisio
from sklearn.metrics import RocCurveDisplay, roc_auc_score, auc
from sklearn.metrics import confusion_matrix, classification_report, ConfusionM

from yellowbrick.regressor import ResidualsPlot, PredictionError

import warnings
warnings.filterwarnings("ignore")

In [59]: df0 = pd.read_csv('adult.csv')


df = df0.copy()

In [3]: df.shape

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26 … 4/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Out[3]: (32561, 15)

In [4]: df.head()

Out[4]: age workclass fnlwgt education education.num marital.status occupation rela

0 90 ? 77053 HS-grad 9 Widowed ?

Exec-
1 82 Private 132870 HS-grad 9 Widowed
managerial

Some-
2 66 ? 186061 10 Widowed ? U
college

Machine-
3 54 Private 140359 7th-8th 4 Divorced U
op-inspct

Some- Prof-
4 41 Private 264663 10 Separated O
college specialty

In [6]: df.tail()

Out[6]: age workclass fnlwgt education education.num marital.status occupation

Some- Never- Protective-


32556 22 Private 310152 10
college married serv

Assoc- Married-civ- Tech-


32557 27 Private 257302 12
acdm spouse support

Married-civ- Machine-
32558 40 Private 154374 HS-grad 9
spouse op-inspct

Adm-
32559 58 Private 151910 HS-grad 9 Widowed
clerical

Never- Adm-
32560 22 Private 201490 HS-grad 9
married clerical

Exploratory Data Analysis (EDA)


In [7]: df.info()

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26 … 5/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 32561 non-null int64
1 workclass 32561 non-null object
2 fnlwgt 32561 non-null int64
3 education 32561 non-null object
4 education.num 32561 non-null int64
5 marital.status 32561 non-null object
6 occupation 32561 non-null object
7 relationship 32561 non-null object
8 race 32561 non-null object
9 sex 32561 non-null object
10 capital.gain 32561 non-null int64
11 capital.loss 32561 non-null int64
12 hours.per.week 32561 non-null int64
13 native.country 32561 non-null object
14 income 32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB

In [60]: df[df == '?'] = np.nan


df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 32561 non-null int64
1 workclass 30725 non-null object
2 fnlwgt 32561 non-null int64
3 education 32561 non-null object
4 education.num 32561 non-null int64
5 marital.status 32561 non-null object
6 occupation 30718 non-null object
7 relationship 32561 non-null object
8 race 32561 non-null object
9 sex 32561 non-null object
10 capital.gain 32561 non-null int64
11 capital.loss 32561 non-null int64
12 hours.per.week 32561 non-null int64
13 native.country 31978 non-null object
14 income 32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB

In [9]: df.describe().T

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26 … 6/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Out[9]: count mean std min 25% 50%

age 32561.0 38.581647 13.640433 17.0 28.0 37.0

fnlwgt 32561.0 189778.366512 105549.977697 12285.0 117827.0 178356.0 23

education.num 32561.0 10.080679 2.572720 1.0 9.0 10.0

capital.gain 32561.0 1077.648844 7385.292085 0.0 0.0 0.0

capital.loss 32561.0 87.303830 402.960219 0.0 0.0 0.0

hours.per.week 32561.0 40.437456 12.347429 1.0 40.0 40.0

In [10]: df.describe(include="object").T

Out[10]: count unique top freq

workclass 30725 8 Private 22696

education 32561 16 HS-grad 10501

marital.status 32561 7 Married-civ-spouse 14976

occupation 30718 14 Prof-specialty 4140

relationship 32561 6 Husband 13193

race 32561 5 White 27816

sex 32561 2 Male 21790

native.country 31978 41 United-States 29170

income 32561 2 <=50K 24720

In [6]: df.duplicated().sum()

Out[6]: 24

In [61]: def duplicate_values(df):


print("Duplicate check...")
num_duplicates = df.duplicated(subset=None, keep='first').sum()
if num_duplicates > 0:
print("There are", num_duplicates, "duplicated observations in the data
df.drop_duplicates(keep='first', inplace=True)
print(num_duplicates, "duplicates were dropped!")
print("No more duplicate rows!")
else:
print("There are no duplicated observations in the dataset.")

In [62]: duplicate_values(df)

Duplicate check...
There are 24 duplicated observations in the dataset.
24 duplicates were dropped!
No more duplicate rows!

In [9]: df.isnull().sum().sum()

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26 … 7/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Out[9]: 4261

In [15]: ax = sns.countplot(x="income", data=df)


ax.bar_label(ax.containers[0]);

Our data is a unbalance data.

Features Summary
In [15]: # !pip install ipywidgets ydata-profiling
#from ydata_profiling import ProfileReport
#profile = ProfileReport(df, title="Profiling Report")
#profile.to_file("profiling_report.html")

In [16]: #!pip install summarytools


from summarytools import dfSummary
dfSummary(df)

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26 … 8/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Out[16]: Data Frame Summary


df
Dimensions: 32,537 x 15
Duplicates: 0

Freqs / (% of
No Variable Stats / Values Graph Missi
Valid)

Mean (sd) : 38.6


(13.6)
age min < med < max: 0
1 73 distinct values
[int64] 17.0 < 37.0 < 90.0 (0.0%)
IQR (CV) : 20.0
(2.8)

1. Private
22,673 (69.7%)
2. Self-emp-not-
2,540 (7.8%)
inc
2,093 (6.4%)
3. Local-gov
1,836 (5.6%)
workclass 4. nan 1,836
2 1,298 (4.0%)
[object] 5. State-gov (5.6%)
1,116 (3.4%)
6. Self-emp-inc
960 (3.0%)
7. Federal-gov
14 (0.0%)
8. Without-pay
7 (0.0%)
9. Never-worked

Mean (sd) :
189780.8
(105556.5)
min < med < max:
fnlwgt 21,648 distinct 0
3 12285.0 <
[int64] values (0.0%)
178356.0 <
1484705.0
IQR (CV) :
119166.0 (1.8)

1. HS-grad 10,494 (32.3%)


2. Some-college 7,282 (22.4%)
3. Bachelors 5,353 (16.5%)
4. Masters 1,722 (5.3%)
5. Assoc-voc 1,382 (4.2%)
education 0
4 6. 11th 1,175 (3.6%)
[object] (0.0%)
7. Assoc-acdm 1,067 (3.3%)
8. 10th 933 (2.9%)
9. 7th-8th 645 (2.0%)
10. Prof-school 576 (1.8%)
11. other 1,908 (5.9%)

Mean (sd) : 10.1


(2.6)
education.num 0
5 min < med < max: 16 distinct values
[int64] (0.0%)
1.0 < 10.0 < 16.0
IQR (CV) : 3.0 (3.9)

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26 … 9/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Freqs / (% of
No Variable Stats / Values Graph Missi
Valid)

1. Married-civ-
spouse
14,970 (46.0%)
2. Never-married
10,667 (32.8%)
3. Divorced
4,441 (13.6%)
marital.status 4. Separated 0
6 1,025 (3.2%)
[object] 5. Widowed (0.0%)
993 (3.1%)
6. Married-
418 (1.3%)
spouse-absent
23 (0.1%)
7. Married-AF-
spouse

1. Prof-specialty
2. Craft-repair
4,136 (12.7%)
3. Exec-managerial
4,094 (12.6%)
4. Adm-clerical
4,065 (12.5%)
5. Sales
3,768 (11.6%)
6. Other-service
3,650 (11.2%)
occupation 7. Machine-op- 1,843
7 3,291 (10.1%)
[object] inspct (5.7%)
2,000 (6.1%)
8. nan
1,843 (5.7%)
9. Transport-
1,597 (4.9%)
moving
1,369 (4.2%)
10. Handlers-
2,724 (8.4%)
cleaners
11. other

1. Husband 13,187 (40.5%)


2. Not-in-family 8,292 (25.5%)
relationship 3. Own-child 5,064 (15.6%) 0
8
[object] 4. Unmarried 3,445 (10.6%) (0.0%)
5. Wife 1,568 (4.8%)
6. Other-relative 981 (3.0%)

1. White
2. Black 27,795 (85.4%)
3. Asian-Pac- 3,122 (9.6%)
race 0
9 Islander 1,038 (3.2%)
[object] (0.0%)
4. Amer-Indian- 311 (1.0%)
Eskimo 271 (0.8%)
5. Other

sex 1. Male 21,775 (66.9%) 0


10
[object] 2. Female 10,762 (33.1%) (0.0%)

Mean (sd) : 1078.4


(7388.0)
capital.gain min < med < max: 0
11 119 distinct values
[int64] 0.0 < 0.0 < (0.0%)
99999.0
IQR (CV) : 0.0 (0.1)

Mean (sd) : 87.4


(403.1)
capital.loss 0
12 min < med < max: 92 distinct values
[int64] (0.0%)
0.0 < 0.0 < 4356.0
IQR (CV) : 0.0 (0.2)

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 10/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Freqs / (% of
No Variable Stats / Values Graph Missi
Valid)

Mean (sd) : 40.4


(12.3)
hours.per.week 0
13 min < med < max: 94 distinct values
[int64] (0.0%)
1.0 < 40.0 < 99.0
IQR (CV) : 5.0 (3.3)

1. United-States 29,153 (89.6%)


2. Mexico 639 (2.0%)
3. nan 582 (1.8%)
4. Philippines 198 (0.6%)
5. Germany 137 (0.4%)
native.country 582
14 6. Canada 121 (0.4%)
[object] (1.8%)
7. Puerto-Rico 114 (0.4%)
8. El-Salvador 106 (0.3%)
9. India 100 (0.3%)
10. Cuba 95 (0.3%)
11. other 1,292 (4.0%)

income 1. <=50K 24,698 (75.9%) 0


15
[object] 2. >50K 7,839 (24.1%) (0.0%)

In [17]: import math


num_cols = df.iloc[:, :-1].shape[1]
num_rows = math.ceil(num_cols / 3)

plt.figure(figsize=(15, 5 * num_rows))
for i, col in enumerate(df.iloc[:, :-1].columns, 1):
plt.subplot(num_rows, 3, i)
plt.title(f"Distribution of {col} Data")
sns.histplot(df[col], kde=True)
plt.tight_layout()
plt.show()

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 11/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [10]: num_cols= df.select_dtypes('number').columns

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 12/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

skew_limit = 0.75 # define a limit above which we will log transf


skew_vals = df[num_cols].skew()

# Showing the skewed columns


skew_cols = (skew_vals
.sort_values(ascending=False)
.to_frame()
.rename(columns={0:'Skew'})
.query('abs(Skew) > {}'.format(skew_limit)))
skew_cols

Out[10]: Skew

capital.gain 11.949403

capital.loss 4.592702

fnlwgt 1.447703

In [20]: sns.pairplot(df, hue= "income", corner=True);

In [63]: cat_features = df.select_dtypes(include=['object']).columns.tolist()


cat_features = [col for col in cat_features if col != 'income']
cat_features

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 13/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Out[63]: ['workclass',
'education',
'marital.status',
'occupation',
'relationship',
'race',
'sex',
'native.country']

In [64]: num_features = df.select_dtypes(include=['number']).columns.tolist()

In [65]: df['income'] = df['income'].apply(lambda x: 0 if x == '<=50K' else 1)

Handling Missing Values


In [24]: df.isnull().sum().sum()

Out[24]: 4261

In [25]: missing_count = df.isnull().sum()


value_count = df.isnull().count()
missing_percentage = round(missing_count / value_count * 100, 2)
missing_df = pd.DataFrame({"count": missing_count, "percentage": missing_percen
missing_df

Out[25]: count percentage

age 0 0.00

workclass 1836 5.64

fnlwgt 0 0.00

education 0 0.00

education.num 0 0.00

marital.status 0 0.00

occupation 1843 5.66

relationship 0 0.00

race 0 0.00

sex 0 0.00

capital.gain 0 0.00

capital.loss 0 0.00

hours.per.week 0 0.00

native.country 582 1.79

income 0 0.00

In [30]: # !pip install missingno


import missingno as msno

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 14/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

msno.matrix(df);

In [66]: num_imputer = SimpleImputer(strategy='median')


cat_imputer = SimpleImputer(strategy='most_frequent')

# Impute numerical columns


df[num_features] = num_imputer.fit_transform(df[num_features])

# Impute categorical columns


df[cat_features] = cat_imputer.fit_transform(df[cat_features])

In [27]: # Let's observe our data in a table

def get_unique_values(df):

output_data = []

for col in df.columns:

# If the number of unique values in the column is less than or equal to


if df.loc[:, col].nunique() <= 10:
# Get the unique values in the column
unique_values = df.loc[:, col].unique()
# Append the column name, number of unique values, unique values, a
output_data.append([col, df.loc[:, col].nunique(), unique_values, d
else:
# Otherwise, append only the column name, number of unique values,
output_data.append([col, df.loc[:, col].nunique(),"-", df.loc[:, co

output_df = pd.DataFrame(output_data, columns=['Column Name', 'Number of Un

return output_df

In [28]: get_unique_values(df)

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 15/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Out[28]: Column Number of Unique Data


Unique Values
Name Values Type

0 age 73 - float64

[Private, State-gov, Federal-gov,


1 workclass 8 object
Self-emp-not...

2 fnlwgt 21648 - float64

3 education 16 - object

4 education.num 16 - float64

[Widowed, Divorced, Separated,


5 marital.status 7 object
Never-married, ...

6 occupation 14 - object

[Not-in-family, Unmarried, Own-


7 relationship 6 object
child, Other-re...

[White, Black, Asian-Pac-Islander,


8 race 5 object
Other, Amer...

9 sex 2 [Female, Male] object

10 capital.gain 119 - float64

11 capital.loss 92 - float64

12 hours.per.week 94 - float64

13 native.country 41 - object

14 income 2 [0, 1] int64

In [67]: import plotly.graph_objects as go


from plotly.subplots import make_subplots

fig = make_subplots(rows=1, cols=2,


subplot_titles=("Unique values per Categorical feature", "U

for col_type, col, color in [("exclude", 1, '#016CC9'), ("include", 2, '#DEB078


temp_data = df.select_dtypes(**{col_type: "number"}).nunique().sort_values(
fig.add_trace(go.Bar(x=temp_data.index, y=temp_data.values, marker=dict(col

fig.show()

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 16/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Feature Engineering and Outliers

Categorical Features
In [11]: df[cat_features].columns

Out[11]: Index(['workclass', 'education', 'marital.status', 'occupation',


'relationship', 'race', 'sex', 'native.country'],
dtype='object')

In [31]: sorted_workclass = ['Private', 'Self-emp-not-inc', 'Local-gov', 'State-gov', 'S

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))

counts = df['workclass'].value_counts().reindex(sorted_workclass[::-1])
counts.plot(kind="barh", ax=ax1, color="teal")
ax1.set_title('Workclass', fontsize=16)
ax1.bar_label(ax1.containers[0], labels=counts.values, fontsize=12)
ax1.tick_params(axis='y', labelsize=16)

# 2.grafig: Workclass Distribution by Income


sns.countplot(y=df["workclass"], hue=df['income'].astype(str), ax=ax2, palette=
ax2.set_title('Workclass Distribution by Income', fontsize=16)

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 17/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

ax2.legend(title='Income', loc='lower right', fontsize=12, title_fontsize='18')


ax2.tick_params(axis='y', labelsize=16);

General Insights

The private sector is the most dominant category among the work classes and
creates a significant disparity in income distribution.

Among self-employed individuals, those who are incorporated earn higher


incomes compared to those who are not incorporated.

For local, state, and federal government jobs, the low-income category is
dominant; however, a significant portion also falls into the high-income category.

Individuals who work without pay and those who have never worked are generally
found in the low-income category.

In [32]: sorted_education = df['education'].value_counts().index[::-1]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))

# Birinci grafik: Top Education Levels


counts = df['education'].value_counts().reindex(sorted_education)
counts.plot(kind="barh", ax=ax1, color="teal")
ax1.set_title('Education Levels', fontsize=16)
ax1.bar_label(ax1.containers[0], labels=counts.values, fontsize=16)
ax1.tick_params(axis='y', labelsize=16)

# İkinci grafik: Education Distribution by Income


sns.countplot(y=df["education"], hue=df['income'].astype(str), ax=ax2, palette=
ax2.set_title('Education Distribution by Income', fontsize=16)
ax2.legend(title='Income', loc='lower right', fontsize=16, title_fontsize='18')
ax2.tick_params(axis='y', labelsize=16)

plt.tight_layout()
plt.show()

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 18/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [68]: df['education'].replace(['1st-4th', '5th-6th'], 'elementary_school', inplace=Tr


df['education'].replace(['7th-8th', '9th', '10th', '11th', '12th'], 'secondary_
df['education'].replace(['Assoc-acdm', 'Assoc-voc'], 'Assoc', inplace=True)

Category Merging: Dividing education levels into too many categories can complicate
data analysis and modeling processes. Therefore, similar levels have been combined to
form larger and more meaningful categories.

In [34]: fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))

# Birinci grafik: Top Education Levels


counts = df['marital.status'].value_counts()
counts.plot(kind="barh", ax=ax1, color="teal")
ax1.set_title('marital.status', fontsize=16)
ax1.bar_label(ax1.containers[0], labels=counts.values, fontsize=16)
ax1.tick_params(axis='y', labelsize=16)

# İkinci grafik: Education Distribution by Income


sns.countplot(y=df["marital.status"], hue=df['income'].astype(str), ax=ax2, pal
ax2.set_title('marital.status Distribution by Income', fontsize=16)
ax2.legend(title='Income', loc='lower right', fontsize=16, title_fontsize='18')
ax2.tick_params(axis='y', labelsize=16)

plt.tight_layout()
plt.show()

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 19/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [69]: df['marital.status'].replace(
['Never-married'], 'NotMarried', inplace=True
)
df['marital.status'].replace(
['Married-AF-spouse', 'Married-civ-spouse'], 'Married', inplace=True
)
df['marital.status'].replace(
['Married-spouse-absent', 'Separated'], 'Separated', inplace=True
)
df['marital.status'].replace(
['Divorced', 'Widowed'], 'Widowed', inplace=True
)

Marital Status Categories Merging In order to simplify the analysis and improve
model performance, we combined similar marital status categories. This helps in
reducing the number of distinct categories, making the data more manageable and the
results more interpretable.

In [41]: sns.histplot(data=df, x='age', hue='marital.status', multiple='stack', palette=

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 20/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

General Insights

Marriage and Age: Marriage rates are low among young adults, peak in middle age,
and decline again in older age. This indicates that focusing on education and career is
common in early life, marriage and family building are more prevalent in middle age,
and loss of a spouse increases in older age.

Tendency Not to Marry: The non-marriage rates are higher among younger age
groups, suggesting that education and career-oriented lifestyles are more common
in modern societies.

Loss of Spouse and Separations: Widowhood is more common in older age,


while separations are more concentrated in middle age. This suggests that both
increased rates of spouse loss due to health reasons and midlife crises or marital
problems are more frequent in these age groups.

In [42]: sns.countplot(y='occupation', hue='income', data=df, order=df['occupation'].val

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 21/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [43]: sns.boxplot(y='occupation', x='age', data=df, order=df['occupation'].value_coun

In [44]: sns.countplot(y='occupation', hue='sex', data=df, order=df['occupation'].value_


plt.title('Occupation Distribution by Sex', fontsize=16);

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 22/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [46]: pivot_table = df.pivot_table(index='education', columns='occupation', aggfunc=


sns.heatmap(pivot_table, annot=True, fmt='d', cmap='viridis');

General Insights

General InsightsIncome and Occupation: Professional and managerial roles yield


higher incomes, while service and manual labor roles are lower-income.

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 23/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Age and Occupation: Older individuals are more prevalent in high-responsibility


roles, whereas younger individuals occupy more entry-level or physically
demanding jobs.
Gender and Occupation: There are significant gender disparities, with males
dominating technical and managerial fields and females more present in clerical
and service roles.
Education and Occupation: Higher education levels correlate with higher-level
occupations, whereas lower education levels are sufficient for service and manual
jobs.

In [47]: sns.countplot(y='relationship', hue='income', data=df, order=df['relationship']

In [48]: sns.countplot(y='race', hue='income', data=df, order=df['race'].value_counts()

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 24/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [71]: df['race'].replace(['Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'Other'],' Othe

In [15]: fig = plt.figure(figsize = (10,6))


ax = fig.add_axes([0,0,1,1])
counts = df["native.country"].value_counts().sort_values(ascending=False).head(
counts.plot(kind = "bar")
plt.title('Top 20 Brand')
plt.xlabel('native.country')
plt.xticks(rotation = 90)
ax.bar_label(ax.containers[0], labels=counts.values, fontsize=12);

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 25/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [70]: df['native.country'] = df['native.country'].replace({


"USA": "United-States"
}).apply(lambda x: "United-States" if x == "United-States" else ("Mexico" if x

General Insights:

Data Imbalance:

The data is heavily skewed towards individuals from the United States, which could
impact the generalizability of any models or analyses performed.

The dataset is predominantly composed of individuals from the United States, with a
minor but noticeable representation from Mexico and a variety of other countries. This
heavy imbalance towards the US population suggests the need for careful handling of
data to avoid biases.

Given the significant representation from Mexico, segmented analyses (e.g., comparing
outcomes between US natives and Mexican immigrants) might be feasible and
insightful.

For other countries with smaller representations, aggregated analyses might be more
appropriate.

Numerical Features
In [54]: df['age_bin'] = pd.cut(df['age'], bins=20)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 6))


sns.countplot(y='age_bin', data=df, palette='viridis', ax=ax1)
sns.histplot(data=df, x='age', hue='income', kde=True, palette='viridis', ax=ax

In [72]: px.histogram(df, x='capital.gain', color="income", barmode='group', title='Inco

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 26/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [73]: px.histogram(df, x='capital.loss', color="income", barmode='group', title='Inco

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 27/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [74]: df['capital_diff'] = df['capital.gain'] - df['capital.loss']


df['capital_diff'] = pd.cut(df['capital_diff'], bins = [-5000, 5000, 100000], l
df['capital_diff'] = df['capital_diff'].astype('object')
df.drop(['capital.gain'], axis = 1, inplace = True)
df.drop(['capital.loss'], axis = 1, inplace = True)

Purpose: To combine the capital.gain (capital gain) and capital.loss (capital loss)
columns into a single column to calculate the net capital gain.

Result: A new column named capital_diff is created.

In [75]: px.histogram(df, x='capital_diff', color="income", barmode='group', title='Inco

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 28/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [76]: px.histogram(df, x='hours.per.week', color="income", barmode='group', title='In

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 29/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [43]: sns.boxplot(data=df,y="hours.per.week",x='income', whis=3);

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 30/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [44]: outliers = df[df['hours.per.week'] > 80]


outliers_income_counts = outliers['income'].value_counts()
outliers_income_counts

Out[44]: income
0 145
1 63
Name: count, dtype: int64

In [45]: outliers = df[df['hours.per.week'] < 15]


outliers_income_counts = outliers['income'].value_counts()
outliers_income_counts

Out[45]: income
0 892
1 81
Name: count, dtype: int64

In [77]: df = df[~((df["hours.per.week"] > 80) | (df["hours.per.week"] < 15))]

The code segments are used to analyze the income status of individuals with extremely
high or low weekly working hours and to remove these outliers from the dataset.

In [78]: df.drop(['fnlwgt'], axis = 1, inplace = True)

fnlwgt: As a result of the analysis, the effect of fnlwgt on the model is almost
negligible. Therefore, it was excluded from the data.

In [48]: df.shape

Out[48]: (31356, 13)

In [49]: df.columns

Out[49]: Index(['age', 'workclass', 'education', 'education.num', 'marital.status',


'occupation', 'relationship', 'race', 'sex', 'hours.per.week',
'native.country', 'income', 'capital_diff'],
dtype='object')

Correlation
In [79]: numeric_df = df.select_dtypes(include=['number'])
corr_matrix = numeric_df.corr()

In [86]: sns.heatmap(corr_matrix .corr(), annot=True, cmap="Blues");

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 31/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [87]: def plot_target_correlation_heatmap(df, target_variable):


df_numeric = df.select_dtypes(include=[np.number])
df_corr_target = df_numeric.corr()

plt.figure(figsize=(2, 7))
sns.heatmap(df_corr_target[[target_variable]], annot=True, vmin=-1, vmax=1,
plt.title(f'Correlation with {target_variable}')
plt.show()
plot_target_correlation_heatmap(df, 'income')

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 32/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Multicollinearity
In [52]: def color_correlation1(val):
"""
Takes a scalar and returns a string with
the css property in a variety of color scales
for different correlations.
"""
if val >= 0.6 and val < 0.99999 or val <= -0.6 and val > -0.99999:
color = 'red'
elif val < 0.6 and val >= 0.3 or val > -0.6 and val <= -0.3:
color = 'blue'
elif val == 1:
color = 'green'
else:
color = 'black'
return 'color: %s' % color

numeric_df = df.select_dtypes(include=[np.number])

numeric_df.corr().style.applymap(color_correlation1)

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 33/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Out[52]: age education.num hours.per.week income

age 1.000000 0.035414 0.110759 0.244210

education.num 0.035414 1.000000 0.163208 0.336660

hours.per.week 0.110759 0.163208 1.000000 0.241994

income 0.244210 0.336660 0.241994 1.000000

In [80]: X = df.drop("income", axis=1)


y = df['income']

Models

Train | Test Split


In [81]: X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.2,
stratify=y,
random_state=42)

make_column_transformer
In [24]: df.columns

Out[24]: Index(['age', 'workclass', 'education', 'education.num', 'marital.status',


'occupation', 'relationship', 'race', 'sex', 'hours.per.week',
'native.country', 'income', 'capital_diff'],
dtype='object')

In [82]: cat_onehot = [
'workclass', 'occupation', 'relationship', 'race', 'sex', 'native.country',
'marital.status'
]
cat_ordinal = ['education', 'capital_diff']

cat_for_edu = [
'Preschool', 'elementary_school', 'secondary_school', 'HS-grad',
'Some-college', 'Assoc', 'Bachelors', 'Masters', 'Prof-school', 'Doctorate
]
cat_for_capdiff = ['Low', 'High']

In [83]: column_trans = make_column_transformer(


(OneHotEncoder(handle_unknown="ignore", sparse_output=False), cat_onehot),
(OrdinalEncoder(categories=[cat_for_edu, cat_for_capdiff]), cat_ordinal),
remainder=StandardScaler())

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 34/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Logistic Regression Model


In [84]: operations = [("transformer", column_trans), ("logistic", LogisticRegression(ma

pipe_model = Pipeline(steps=operations)

pipe_model.fit(X_train, y_train)

Out[84]: ▸ Pipeline i ?

▸ transformer: ColumnTransformer ?

▸ onehotencoder ▸ ordinalencoder ▸ remainder

▸ ▸ ▸
OneHotEncoder OrdinalEncoder StandardScaler
? ? ?

▸ LogisticRegression ?

In [85]: ConfusionMatrixDisplay.from_estimator(pipe_model,
X_test,
y_test,
normalize='true');

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 35/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [86]: from yellowbrick.classifier import ClassPredictionError


visualizer = ClassPredictionError(pipe_model)
# Fit the training data to the visualizer
visualizer.fit(X_train, y_train)
# Evaluate the model on the test data
visualizer.score(X_test, y_test)
# Draw visualization
visualizer.poof();

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 36/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [87]: def eval_metric(model, X_train, y_train, X_test, y_test,i):


y_train_pred = model.predict(X_train)
y_pred = model.predict(X_test)
print(f"{i} Test_Set")
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print()
print(f"{i} Train_Set")
print(confusion_matrix(y_train, y_train_pred))
print(classification_report(y_train, y_train_pred))

In [88]: eval_metric(pipe_model, X_train, y_train, X_test, y_test, "logistic")

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 37/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

logistic Test_Set
[[4443 290]
[ 636 903]]
precision recall f1-score support

0 0.87 0.94 0.91 4733


1 0.76 0.59 0.66 1539

accuracy 0.85 6272


macro avg 0.82 0.76 0.78 6272
weighted avg 0.85 0.85 0.85 6272

logistic Train_Set
[[17632 1296]
[ 2528 3628]]
precision recall f1-score support

0 0.87 0.93 0.90 18928


1 0.74 0.59 0.65 6156

accuracy 0.85 25084


macro avg 0.81 0.76 0.78 25084
weighted avg 0.84 0.85 0.84 25084

Cross Validate
In [113… operations = [("transformer", column_trans), ("logistic", LogisticRegression(ma

pipecv_model = Pipeline(steps=operations)

cv = StratifiedKFold(n_splits=10)

scores = cross_validate(pipecv_model,
X_train,
y_train,
scoring=["accuracy", "precision", "recall", "f1"],
cv=cv,
return_train_score = True)
df_scores = pd.DataFrame(scores, index=range(1,11))
df_scores.mean()[2:]

Out[113… test_accuracy 0.846596


train_accuracy 0.847543
test_precision 0.734494
train_precision 0.736169
test_recall 0.588371
train_recall 0.590355
test_f1 0.653002
train_f1 0.655246
dtype: float64

Precision Recall Curve and Roc Curve Display


In [30]: RocCurveDisplay.from_estimator(pipe_model, X_test, y_test);

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 38/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [31]: PrecisionRecallDisplay.from_estimator(pipe_model, X_test, y_test);

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 39/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

GridSearchCV
param_grid = [ { "logistic__penalty" : ['l1', 'l2'], "logistic__C" : [0.01, 0.05,0.03, 0.1, 1],
"logistic__class_weight": ["balanced", None] , "logistic__solver": ['liblinear', 'saga', 'lbfgs'],
"logistic__max_iter": [1000, 2000] } ]

Many grids of money have been tried. Finally, the following features were identified.

In [89]: operations = [("transformer", column_trans), ("logistic", LogisticRegression(ra

log_model = Pipeline(steps=operations)

param_grid = [
{
"logistic__penalty" : ['l1'],
"logistic__C" : [0.03],
"logistic__class_weight": ["balanced"] ,
"logistic__solver": ['saga'],
"logistic__max_iter": [1000]
}
]
cv = StratifiedKFold(n_splits = 10)

grid_model = GridSearchCV(estimator=log_model,
param_grid=param_grid,
cv=cv,
scoring = "f1",

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 40/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

n_jobs = -1,
return_train_score=True).fit(X_train, y_train)

In [90]: grid_model.best_estimator_

Out[90]: ▸ Pipeline i ?

▸ transformer: ColumnTransformer ?

▸ onehotencoder ▸ ordinalencoder ▸ remainder

▸ ▸ ▸
OneHotEncoder OrdinalEncoder StandardScaler
? ? ?

▸ LogisticRegression ?

In [91]: grid_model.best_score_

Out[91]: 0.683264633932356

In [92]: grid_model.best_index_

Out[92]: 0

In [93]: pd.DataFrame(grid_model.cv_results_).loc[0, ["mean_test_score", "mean_train_sco

Out[93]: mean_test_score 0.683265


mean_train_score 0.682596
Name: 0, dtype: object

In [94]: y_pred = grid_model.predict(X_test)


y_pred_proba = grid_model.predict_proba(X_test)

log_f1 = f1_score(y_test, y_pred)

log_recall = recall_score(y_test, y_pred)

log_auc = roc_auc_score(y_test, y_pred)

precision, recall, _ = precision_recall_curve(y_test, grid_model.predict_proba(


log_prc = auc(recall, precision)

log_grid_model = eval_metric(grid_model, X_train, y_train, X_test, y_test,"logi


log_grid_model

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 41/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

logisticgrid Test_Set
[[3783 950]
[ 253 1286]]
precision recall f1-score support

0 0.94 0.80 0.86 4733


1 0.58 0.84 0.68 1539

accuracy 0.81 6272


macro avg 0.76 0.82 0.77 6272
weighted avg 0.85 0.81 0.82 6272

logisticgrid Train_Set
[[15047 3881]
[ 952 5204]]
precision recall f1-score support

0 0.94 0.79 0.86 18928


1 0.57 0.85 0.68 6156

accuracy 0.81 25084


macro avg 0.76 0.82 0.77 25084
weighted avg 0.85 0.81 0.82 25084

In [95]: log_grid_matrix = ConfusionMatrixDisplay.from_estimator(grid_model,


X_test,
y_test,
normalize='true');

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 42/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [96]: RocCurveDisplay.from_estimator(grid_model, X_test, y_test);

In [97]: PrecisionRecallDisplay.from_estimator(grid_model, X_test, y_test);

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 43/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

KNN Model
In [39]: operations = [("transformer", column_trans), ("knn", KNeighborsClassifier())]

pipe_model = Pipeline(steps=operations)

pipe_model.fit(X_train, y_train)

Out[39]: ▸ Pipeline i ?

▸ transformer: ColumnTransformer ?

▸ onehotencoder ▸ ordinalencoder ▸ remainder

▸ ▸ ▸
OneHotEncoder OrdinalEncoder StandardScaler
? ? ?

▸ KNeighborsClassifier ?

In [164… eval_metric(pipe_model, X_train, y_train, X_test, y_test, "knn")

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 44/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

knn Test_Set
[[4286 447]
[ 655 884]]
precision recall f1-score support

0 0.87 0.91 0.89 4733


1 0.66 0.57 0.62 1539

accuracy 0.82 6272


macro avg 0.77 0.74 0.75 6272
weighted avg 0.82 0.82 0.82 6272

knn Train_Set
[[17746 1182]
[ 1881 4275]]
precision recall f1-score support

0 0.90 0.94 0.92 18928


1 0.78 0.69 0.74 6156

accuracy 0.88 25084


macro avg 0.84 0.82 0.83 25084
weighted avg 0.87 0.88 0.88 25084

In [168… RocCurveDisplay.from_estimator(pipe_model, X_test, y_test);

In [169… PrecisionRecallDisplay.from_estimator(pipe_model, X_test, y_test);

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 45/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Elbow Method for Choosing Reasonable K Values


In [98]: operations = [("transformer", column_trans), ("knn", KNeighborsClassifier())]

pipe_model = Pipeline(steps=operations)

pipe_model.fit(X_train, y_train)

Out[98]: ▸ Pipeline i ?

▸ transformer: ColumnTransformer ?

▸ onehotencoder ▸ ordinalencoder ▸ remainder

▸ ▸ ▸
OneHotEncoder OrdinalEncoder StandardScaler
? ? ?

▸ KNeighborsClassifier ?

In [172… test_error_rates = []

for k in range(1, 10):

operations = [("transformer", column_trans), ("knn", KNeighborsClassifier(n

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 46/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

knn_pipe_model = Pipeline(steps=operations)

scores = cross_validate(knn_pipe_model, X_train, y_train, scoring = ['f1'],

f1_mean = scores["test_f1"].mean()

test_error = 1 - f1_mean

test_error_rates.append(test_error)

In [174… plt.figure(figsize=(15, 8))


plt.plot(range(1, 10),
test_error_rates,
color='red',
marker='o',
markerfacecolor='yellow',
markersize=10)

plt.title('Error Rate vs. K Value')


plt.xlabel('K_values')
plt.ylabel('Error Rate')
plt.hlines(y=0.25, xmin=0, xmax=20, colors='b', linestyles="--")
plt.hlines(y=0.65, xmin=0, xmax=20, colors='b', linestyles="--")

Out[174… <matplotlib.collections.LineCollection at 0x2604749d090>

Overfiting and underfiting control for k values


In [175… test_error_rates = []
train_error_rates = []

for k in range(1, 10):

operations = [("transformer", column_trans), ("knn", KNeighborsClassifier(n

knn_pipe_model = Pipeline(steps=operations)

knn_pipe_model.fit(X_train, y_train)

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 47/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

scores = cross_validate(knn_pipe_model, X_train, y_train, scoring = ['f1'],

f1_test_mean = scores["test_f1"].mean()
f1_train_mean = scores["train_f1"].mean()

test_error = 1 - f1_test_mean
train_error = 1 -f1_train_mean
test_error_rates.append(test_error)
train_error_rates.append(train_error)

In [176… plt.figure(figsize=(15, 8))


plt.plot(range(1, 10),
test_error_rates,
color='red',
marker='o',
markerfacecolor='yellow',
markersize=10)

plt.plot(range(1, 10),
train_error_rates,
color='red',
marker='o',
markerfacecolor='green',
markersize=10)

plt.title('Error Rate vs. K Value')


plt.xlabel('K_values')
plt.ylabel('Error Rate')
plt.hlines(y=0.25, xmin=0, xmax=20, colors='b', linestyles="--")
plt.hlines(y=0.65, xmin=0, xmax=20, colors='b', linestyles="--")

Out[176… <matplotlib.collections.LineCollection at 0x260475d7150>

In [177… k_list = [3, 5, 7]

for i in k_list:
operations = [("transformer", column_trans), ("knn", KNeighborsClassifier(n
knn = Pipeline(steps=operations)

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 48/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

knn.fit(X_train, y_train)
print(f'WITH K={i}\n')
eval_metric(knn, X_train, y_train, X_test, y_test, "knn_elbow")

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 49/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

WITH K=3

knn_elbow Test_Set
[[4236 497]
[ 684 855]]
precision recall f1-score support

0 0.86 0.89 0.88 4733


1 0.63 0.56 0.59 1539

accuracy 0.81 6272


macro avg 0.75 0.73 0.73 6272
weighted avg 0.80 0.81 0.81 6272

knn_elbow Train_Set
[[17848 1080]
[ 1568 4588]]
precision recall f1-score support

0 0.92 0.94 0.93 18928


1 0.81 0.75 0.78 6156

accuracy 0.89 25084


macro avg 0.86 0.84 0.85 25084
weighted avg 0.89 0.89 0.89 25084

WITH K=5

knn_elbow Test_Set
[[4286 447]
[ 655 884]]
precision recall f1-score support

0 0.87 0.91 0.89 4733


1 0.66 0.57 0.62 1539

accuracy 0.82 6272


macro avg 0.77 0.74 0.75 6272
weighted avg 0.82 0.82 0.82 6272

knn_elbow Train_Set
[[17746 1182]
[ 1881 4275]]
precision recall f1-score support

0 0.90 0.94 0.92 18928


1 0.78 0.69 0.74 6156

accuracy 0.88 25084


macro avg 0.84 0.82 0.83 25084
weighted avg 0.87 0.88 0.88 25084

WITH K=7

knn_elbow Test_Set
[[4315 418]
[ 647 892]]
precision recall f1-score support

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 50/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

0 0.87 0.91 0.89 4733


1 0.68 0.58 0.63 1539

accuracy 0.83 6272


macro avg 0.78 0.75 0.76 6272
weighted avg 0.82 0.83 0.83 6272

knn_elbow Train_Set
[[17663 1265]
[ 2017 4139]]
precision recall f1-score support

0 0.90 0.93 0.91 18928


1 0.77 0.67 0.72 6156

accuracy 0.87 25084


macro avg 0.83 0.80 0.82 25084
weighted avg 0.87 0.87 0.87 25084

Cross Validate For Optimal K Value


In [178… operations = operations = [("transformer", column_trans), ("knn", KNeighborsC

model = Pipeline(steps=operations)

scores = cross_validate(model,
X_train,
y_train,
scoring=['accuracy', 'precision', 'recall', 'f1'],
cv=10,
return_train_score=True)
df_scores = pd.DataFrame(scores, index=range(1, 11))
df_scores.mean()[2:]

Out[178… test_accuracy 0.834077


train_accuracy 0.868296
test_precision 0.684224
train_precision 0.763656
test_recall 0.602503
train_recall 0.671017
test_f1 0.640533
train_f1 0.714341
dtype: float64

Gridsearch Method for Choosing Reasonable K


Values
In [100… operations = [("transformer", column_trans), ("knn", KNeighborsClassifier())]
knn_model = Pipeline(steps=operations)

Many grids of money have been tried. Finally, the following features were
identified.Tried values up to k_values = 30.

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 51/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [101… param_grid = [
{
"knn__n_neighbors": [19],
"knn__metric": ['euclidean'],
"knn__weights": ['uniform']
}
]

knn_grid_model = GridSearchCV(knn_model,
param_grid,
scoring='f1',
cv=5,
return_train_score=True,
n_jobs=-1).fit(X_train, y_train)

In [102… knn_grid_model.best_estimator_

Out[102… ▸ Pipeline i ?

▸ transformer: ColumnTransformer ?

▸ onehotencoder ▸ ordinalencoder ▸ remainder

▸ ▸ ▸
OneHotEncoder OrdinalEncoder StandardScaler
? ? ?

▸ KNeighborsClassifier ?

In [103… knn_grid_model.best_index_

Out[103… 0

In [104… pd.DataFrame(
knn_grid_model.cv_results_).loc[0,["mean_test_score", "mean_train_score"]]

Out[104… mean_test_score 0.6397


mean_train_score 0.675013
Name: 0, dtype: object

In [105… knn_grid_model.best_score_

Out[105… 0.6396999259520801

In [106… y_pred = knn_grid_model.predict(X_test)


y_pred_proba = knn_grid_model.predict_proba(X_test)

knn_f1 = f1_score(y_test, y_pred)

knn_recall = recall_score(y_test, y_pred)

knn_auc = roc_auc_score(y_test, y_pred)

precision, recall, _ = precision_recall_curve(y_test, knn_grid_model.predict_pr


knn_prc = auc(recall, precision)

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 52/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

eval_metric(knn_grid_model, X_train, y_train, X_test, y_test, "knn_grid") #k=19

knn_grid Test_Set
[[4367 366]
[ 628 911]]
precision recall f1-score support

0 0.87 0.92 0.90 4733


1 0.71 0.59 0.65 1539

accuracy 0.84 6272


macro avg 0.79 0.76 0.77 6272
weighted avg 0.83 0.84 0.84 6272

knn_grid Train_Set
[[17492 1436]
[ 2269 3887]]
precision recall f1-score support

0 0.89 0.92 0.90 18928


1 0.73 0.63 0.68 6156

accuracy 0.85 25084


macro avg 0.81 0.78 0.79 25084
weighted avg 0.85 0.85 0.85 25084

As a result of the values we gave to K, the tests did not improve, but we
prevented overfitting and found more reliable results

Precision Recall Curve and Roc Curve Display


In [190… RocCurveDisplay.from_estimator(knn_grid_model, X_test, y_test);

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 53/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [191… y_pred_proba = knn.predict_proba(X_test)


roc_auc_score(y_test, y_pred_proba[:,1])

Out[191… 0.859903581601922

In [192… PrecisionRecallDisplay.from_estimator(pipe_model, X_test, y_test)

Out[192… <sklearn.metrics._plot.precision_recall_curve.PrecisionRecallDisplay at 0x2604


727de50>

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 54/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

SVM Model
In [107… operations = [("transformer", column_trans),("SVC", SVC(random_state=42))]
pipe_model = Pipeline(steps=operations)
pipe_model.fit(X_train, y_train)

Out[107… ▸ Pipeline i ?

▸ transformer: ColumnTransformer ?

▸ onehotencoder ▸ ordinalencoder ▸ remainder

▸ ▸ ▸
OneHotEncoder OrdinalEncoder StandardScaler
? ? ?

▸ SVC ?

Model Performance
In [194… eval_metric(pipe_model, X_train, y_train, X_test, y_test, "svm")

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 55/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

svm Test_Set
[[4498 235]
[ 692 847]]
precision recall f1-score support

0 0.87 0.95 0.91 4733


1 0.78 0.55 0.65 1539

accuracy 0.85 6272


macro avg 0.82 0.75 0.78 6272
weighted avg 0.85 0.85 0.84 6272

svm Train_Set
[[17901 1027]
[ 2756 3400]]
precision recall f1-score support

0 0.87 0.95 0.90 18928


1 0.77 0.55 0.64 6156

accuracy 0.85 25084


macro avg 0.82 0.75 0.77 25084
weighted avg 0.84 0.85 0.84 25084

In [198… operations = [("transformer", column_trans), ("SVC", SVC(random_state=42))]

pipe_model = Pipeline(steps=operations)

cv = StratifiedKFold(n_splits=5)

scores = cross_validate(pipe_model,
X_train,
y_train,
scoring=['accuracy', 'precision', 'recall', 'f1'],
cv=cv,
return_train_score=True,
n_jobs=-1)

df_scores = pd.DataFrame(scores, index=range(1, 6))


df_scores.mean()[2:]

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.


[Parallel(n_jobs=-1)]: Done 3 out of 5 | elapsed: 2.5min remaining: 1.7mi
n
[Parallel(n_jobs=-1)]: Done 5 out of 5 | elapsed: 6.7min finished
Out[198… test_accuracy 0.847353
train_accuracy 0.849097
test_precision 0.764130
train_precision 0.769085
test_recall 0.546947
train_recall 0.550357
test_f1 0.637474
train_f1 0.641591
dtype: float64

GridsearchCV

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 56/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

param_grid = {'SVC__C': [0.01, 0.1, 1, 10, 100], 'SVC__gamma': ["scale", "auto", 0.001,
0.01, 0.1, 0.5], 'SVC__kernel': ['rbf', 'linear'],}

Many grids of money have been tried. Finally, the following features were identified.

In [108… param_grid = {"SVC__C":[1],


"SVC__gamma":[0.3],
"SVC__kernel":["rbf"]}

operations = [("transformer", column_trans), ("SVC", SVC(class_weight="balanced

svm_model_grid = GridSearchCV(pipe_model,
param_grid,
scoring="recall_macro",
cv=5,
return_train_score=True,
n_jobs=2,
verbose=2).fit(X_train, y_train)

Fitting 5 folds for each of 1 candidates, totalling 5 fits

In [109… svm_model_grid.best_estimator_

Out[109… ▸ Pipeline i ?

▸ transformer: ColumnTransformer ?

▸ onehotencoder ▸ ordinalencoder ▸ remainder

▸ ▸ ▸
OneHotEncoder OrdinalEncoder StandardScaler
? ? ?

▸ SVC ?

In [110… svm_model_grid.best_index_

Out[110… 0

In [111… pd.DataFrame(
svm_model_grid.cv_results_).loc[0,
["mean_test_score", "mean_train_score"]]

Out[111… mean_test_score 0.773176


mean_train_score 0.807849
Name: 0, dtype: object

In [112… svm_model_grid.best_score_

Out[112… 0.7731760615298443

In [113… y_pred = svm_model_grid.predict(X_test)


y_pred_proba = svm_model_grid.decision_function(X_test)

svm_f1 = f1_score(y_test, y_pred)

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 57/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

svm_recall = recall_score(y_test, y_pred)

svm_auc = roc_auc_score(y_test, y_pred)

precision, recall, _ = precision_recall_curve(y_test, svm_model_grid.decision_f


svm_prc = auc(recall, precision)

eval_metric(svm_model_grid, X_train, y_train, X_test, y_test, "svm_grid")

svm_grid Test_Set
[[4409 324]
[ 607 932]]
precision recall f1-score support

0 0.88 0.93 0.90 4733


1 0.74 0.61 0.67 1539

accuracy 0.85 6272


macro avg 0.81 0.77 0.79 6272
weighted avg 0.85 0.85 0.85 6272

svm_grid Train_Set
[[17835 1093]
[ 2040 4116]]
precision recall f1-score support

0 0.90 0.94 0.92 18928


1 0.79 0.67 0.72 6156

accuracy 0.88 25084


macro avg 0.84 0.81 0.82 25084
weighted avg 0.87 0.88 0.87 25084

In [114… decision_function = svm_model_grid.decision_function(X_test)


average_precision_score(y_test, decision_function)

Out[114… 0.731964885295869

Precision Recall Curve and Roc Curve Display


In [52]: RocCurveDisplay.from_estimator(svm_model_grid, X_test, y_test);

Out[52]: <sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x2b5463ec710>

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 58/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [53]: PrecisionRecallDisplay.from_estimator(svm_model_grid, X_test, y_test);

Out[53]: <sklearn.metrics._plot.precision_recall_curve.PrecisionRecallDisplay at 0x2b54


625b950>

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 59/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Compare Models Performance


In [115… compare = pd.DataFrame({"Model": ["Logistic Regression", "KNN", "SVM"],
"F1": [log_f1, knn_f1, svm_f1 ],
"Recall": [log_recall, knn_recall, svm_recall ],
"ROC_AUC": [log_auc, knn_auc, svm_auc],
"PRC" : [log_prc, knn_prc, svm_prc]})
def labels(ax):
for p in ax.patches:
width = p.get_width() # get bar length
ax.text(width, # set the text at 1 unit r
p.get_y() + p.get_height() / 2, # get Y coordinate + X coo
'{:1.3f}'.format(width), # set variable to display,
ha = 'left', # horizontal alignment
va = 'center') # vertical alignment
plt.figure(figsize=(14,12))

plt.subplot(411)
compare = compare.sort_values(by="F1", ascending=False)
ax=sns.barplot(x="F1", y="Model", data=compare, palette="magma")
labels(ax)

plt.subplot(412)
compare = compare.sort_values(by="Recall", ascending=False)
ax=sns.barplot(x="Recall", y="Model", data=compare, palette="magma")
labels(ax)

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 60/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

plt.subplot(413)
compare = compare.sort_values(by="ROC_AUC", ascending=False)
ax=sns.barplot(x="ROC_AUC", y="Model", data=compare, palette="magma")
labels(ax)

plt.subplot(414)
compare = compare.sort_values(by="PRC", ascending=False)
ax=sns.barplot(x="PRC", y="Model", data=compare, palette="magma")
labels(ax)

plt.show()

Final Model and Model Deployment


In [118… operations = [("transformer", column_trans), ("logistic", LogisticRegression(ra

log_model = Pipeline(steps=operations)

param_grid = [
{
"logistic__penalty" : ['l1'],
"logistic__C" : [0.03],
"logistic__class_weight": ["balanced"] ,
"logistic__solver": ['saga'],
"logistic__max_iter": [1000]
}
]
cv = StratifiedKFold(n_splits = 10)

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 61/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

final_pipe_model = GridSearchCV(estimator=log_model,
param_grid=param_grid,
cv=cv,
scoring = "f1",
n_jobs = -1,
return_train_score=True).fit(X, y)

In [119… import pickle


pickle.dump(final_model, open("final_pipe_model", "wb"))

In [120… new_model = pickle.load(open("final_pipe_model", "rb"))


new_model

Out[120… ▸ GridSearchCV i ?

▸ best_estimator_: Pipeline

▸ transformer: ColumnTransformer ?

▸ onehotencoder ▸ ordinalencoder ▸ remainder

▸ ▸ ▸
OneHotEncoder OrdinalEncoder StandardScaler
? ? ?

▸ LogisticRegression ?

Prediction
In [126… my_dict= {
'age': [44.0, 32.0, 30.0],
'workclass': ['Federal-gov', 'Private', 'Self-emp-not-inc'],
'education': ['Bachelors', 'Bachelors', 'Some-college'],
'education.num': [13.0, 13.0, 10.0],
'marital.status': ['Widowed', 'Married', 'NotMarried'],
'occupation': ['Tech-support', 'Sales', 'Sales'],
'relationship': ['Not-in-family', 'Husband', 'Other-relative'],
'race': ['White', 'White', 'Others'],
'sex': ['Male', 'Male', 'Male'],
'hours.per.week': [40.0, 40.0, 40.0],
'native.country': ['United-States', 'United-States', 'Other'],
'capital_diff': ['Low', 'Low', 'Low']
}

In [128… sample = pd.DataFrame(my_dict)


sample

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 62/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Out[128… age workclass education education.num marital.status occupation relationship

Federal- Tech- Not-in-


0 44.0 Bachelors 13.0 Widowed
gov support family

1 32.0 Private Bachelors 13.0 Married Sales Husband

Self-emp- Some- Other-


2 30.0 10.0 NotMarried Sales
not-inc college relative

In [129… new_model.predict(sample)

Out[129… array([1, 1, 0], dtype=int64)

In [130… new_model.decision_function(sample)

Out[130… array([ 0.27369739, 1.16079392, -2.56091314])

📂 Conclusion
In [ ]: # Logistic grid recall: 83, f1 : 0.68 prc=0.75

In an unbalanced dataset, F1-Score and Recall metrics are indeed very important.
These metrics play a critical role in evaluating model performance in unbalanced
datasets, as they measure the model's ability to correctly predict the minority class.

When prioritizing F1-Score and Recall:

The Logistic Regression model stands out with a Recall of 0.83 and an F1-Score of
0.68. This model demonstrates balanced performance across the classes in the
unbalanced dataset, effectively capturing the minority class while also performing
well in overall classification.

The KNN Model, although it performs well in terms of accuracy, lags behind
Logistic Regression with a Recall of 0.59 and an F1-Score of 0.64. This indicates that
the model is less effective at capturing the minority class in the unbalanced
dataset.

The SVM Model, despite excelling in accuracy, also falls behind Logistic Regression
in these two metrics with a Recall of 0.60 and an F1-Score of 0.66. It is evident that
SVM is not sufficiently successful in capturing the minority class.

Based on these results, I can say that the Logistic Regression model offers the best
performance in terms of Recall and F1-Score for unbalanced datasets and should
therefore be preferred. Especially in unbalanced datasets, it is critical that the
model correctly identifies the minority class, making Logistic Regression the most
suitable choice.

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 63/64


13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

THANK YOU

If you want to be the first to be informed about new projects, please do not
forget to follow us - by Fatma Nur AZMAN
Fatmanurazman.com | Linkedin | Github | Kaggle | Tableau

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 64/64

You might also like