0% found this document useful (0 votes)

57 views64 pages

Understanding Data

Uploaded by

sunil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views64 pages

Understanding Data

Uploaded by

sunil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

13.08.

2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Adult Income Prediction

If you want to be the first to be informed about new projects, please do not forget
to follow us - by Fatma Nur AZMAN
Fatmanurazman.com | Linkedin | Github | Kaggle | Tableau

Understanding The Data

Project Description:
Adult Income Prediction This dataset was obtained from UCI Machine Learning
Repository. The aim of this problem is to classify adults in two different groups
based on their income where group 1 has an income less than USD 50k and group
2 has an income of more than or equal to USD 50k. The data available at hand
comes from Census 1994.

Domain Knowledge:
Economic Conditions

Technological Revolution:

At the beginning of the 1990s, the widespread adoption of the internet and the rapid
development of computer technology led to significant changes in the labor market.
Information technology and service sectors grew rapidly, creating many new job
opportunities.

Economic Growth:

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26 … 1/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

The US economy entered a significant growth period from the mid-1990s. This growth
was supported by low inflation and low unemployment rates. However, economic
opportunities were not equally distributed across all regions and groups.

Social and Political Situation

Diversity and Immigration:

In the 1990s, the number of people immigrating to the US increased. Immigrants

played a crucial role in the labor market and met labor demands in many sectors. This
situation also led to some social tensions and debates.

Education and Workforce:

The increasing importance of education levels in the labor market directly affected
individuals' income levels. Higher-educated individuals generally worked in higher-
paying jobs, while lower-educated individuals had to work in low-wage jobs.

Demographic Changes

Aging Population:

The aging of the baby boomer generation began to put pressure on social security
systems and healthcare services. The increasing number of individuals reaching
retirement age also led to changes in the labor market.

Women's Participation in the Workforce:

Women's participation in the workforce increased significantly in the 1990s. This led to
an increase in household incomes and changes in gender roles in society.

Sectoral Changes

Transformation of the Manufacturing Industry:

In the 1990s, while the manufacturing industry declined in some regions, the service
and technology-based sectors grew. This transformation led to increased
unemployment rates in some areas and economic imbalances.

Globalization:

Globalization led to increased trade and investments. Many US companies moved their
production facilities abroad while gaining access to global markets. This caused some
uncertainties and changes in the labor market.

In this context, the data obtained from the 1994 Census reflects the aforementioned
economic, social, and demographic changes. By examining the impact of education
levels, gender, race, and occupations on income in the labor market, we can better
understand the social dynamics of that period. These analyses can also contribute to
understanding the changes and continuities in comparison with today's conditions.

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26 … 2/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

About the Dataset

Dataset Descriptions:

Rows: 32561
Columns: 15

Attribute
STT Unique Values
Name

1 Age Describes the age of individuals. Continuous.

Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-

2 Workclass
gov, Without-pay, Never-worked.

Continuous. This is a weighting factor created by the US Census

3 fnlwgt Bureau and indicates the number of people represented by each data
entry.

Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm,

4 education Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-
6th, Preschool.

education-
5 Number of years spent in education. Continuous.
num

marital- Married-civ-spouse, Divorced, Never-married, Separated, Widowed,

6
status Married-spouse-absent, Married-AF-spouse.

Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-

specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical,
7 occupation
Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv,
Armed-Forces.

8 relationship Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

9 race White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

10 sex Female, Male.

Represents the profit an individual makes from the sale of assets (e.g.,
11 capital-gain
stocks or real estate). Continuous.

Represents the loss an individual incurs from the sale of assets (e.g.,
12 capital-loss
stocks or real estate). Continuous.

hours-per-
13 Continuous.
week

United-States, Cambodia, England, Puerto-Rico, Canada, Germany,

Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba,
native- Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico,
14
country Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan,
Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand,
Yugoslavia, El-Salvador, Trinidad & Tobago, Peru, Hong, Netherlands.

15 salary >50K, <=50K.

Table of CONTENTS
localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26 … 3/64
13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Understanding The Data

Exploratory Data Analysis (EDA)
Feature Engineering and Outliers
Correlation
Models
Logistic Regression Model
KNN Model
SVM Model
Compare Models Performance
Final Model and Model Deployment
Prediction
Conclusion

Import Libraries and Data Review

In [58]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

%matplotlib inline

from sklearn.impute import SimpleImputer

from scipy import stats

from sklearn.model_selection import train_test_split, GridSearchCV, cross_valid
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from sklearn.compose import make_column_transformer

from sklearn.metrics import make_scorer, precision_score, recall_score, f1_scor

from sklearn.metrics import PrecisionRecallDisplay, roc_curve, average_precisio
from sklearn.metrics import RocCurveDisplay, roc_auc_score, auc
from sklearn.metrics import confusion_matrix, classification_report, ConfusionM

from yellowbrick.regressor import ResidualsPlot, PredictionError

import warnings
warnings.filterwarnings("ignore")

In [59]: df0 = pd.read_csv('adult.csv')

df = df0.copy()

In [3]: df.shape

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26 … 4/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Out[3]: (32561, 15)

In [4]: df.head()

Out[4]: age workclass fnlwgt education education.num marital.status occupation rela

0 90 ? 77053 HS-grad 9 Widowed ?

Exec-
1 82 Private 132870 HS-grad 9 Widowed
managerial

Some-
2 66 ? 186061 10 Widowed ? U
college

Machine-
3 54 Private 140359 7th-8th 4 Divorced U
op-inspct

Some- Prof-
4 41 Private 264663 10 Separated O
college specialty

In [6]: df.tail()

Out[6]: age workclass fnlwgt education education.num marital.status occupation

Some- Never- Protective-

32556 22 Private 310152 10
college married serv

Assoc- Married-civ- Tech-

32557 27 Private 257302 12
acdm spouse support

Married-civ- Machine-
32558 40 Private 154374 HS-grad 9
spouse op-inspct

Adm-
32559 58 Private 151910 HS-grad 9 Widowed
clerical

Never- Adm-
32560 22 Private 201490 HS-grad 9
married clerical

Exploratory Data Analysis (EDA)

In [7]: df.info()

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26 … 5/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 32561 non-null int64
1 workclass 32561 non-null object
2 fnlwgt 32561 non-null int64
3 education 32561 non-null object
4 education.num 32561 non-null int64
5 marital.status 32561 non-null object
6 occupation 32561 non-null object
7 relationship 32561 non-null object
8 race 32561 non-null object
9 sex 32561 non-null object
10 capital.gain 32561 non-null int64
11 capital.loss 32561 non-null int64
12 hours.per.week 32561 non-null int64
13 native.country 32561 non-null object
14 income 32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB

In [60]: df[df == '?'] = np.nan

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 32561 non-null int64
1 workclass 30725 non-null object
2 fnlwgt 32561 non-null int64
3 education 32561 non-null object
4 education.num 32561 non-null int64
5 marital.status 32561 non-null object
6 occupation 30718 non-null object
7 relationship 32561 non-null object
8 race 32561 non-null object
9 sex 32561 non-null object
10 capital.gain 32561 non-null int64
11 capital.loss 32561 non-null int64
12 hours.per.week 32561 non-null int64
13 native.country 31978 non-null object
14 income 32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB

In [9]: df.describe().T

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26 … 6/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Out[9]: count mean std min 25% 50%

age 32561.0 38.581647 13.640433 17.0 28.0 37.0

fnlwgt 32561.0 189778.366512 105549.977697 12285.0 117827.0 178356.0 23

education.num 32561.0 10.080679 2.572720 1.0 9.0 10.0

capital.gain 32561.0 1077.648844 7385.292085 0.0 0.0 0.0

capital.loss 32561.0 87.303830 402.960219 0.0 0.0 0.0

hours.per.week 32561.0 40.437456 12.347429 1.0 40.0 40.0

In [10]: df.describe(include="object").T

Out[10]: count unique top freq

workclass 30725 8 Private 22696

education 32561 16 HS-grad 10501

marital.status 32561 7 Married-civ-spouse 14976

occupation 30718 14 Prof-specialty 4140

relationship 32561 6 Husband 13193

race 32561 5 White 27816

sex 32561 2 Male 21790

native.country 31978 41 United-States 29170

income 32561 2 <=50K 24720

In [6]: df.duplicated().sum()

Out[6]: 24

In [61]: def duplicate_values(df):

print("Duplicate check...")
num_duplicates = df.duplicated(subset=None, keep='first').sum()
if num_duplicates > 0:
print("There are", num_duplicates, "duplicated observations in the data
df.drop_duplicates(keep='first', inplace=True)
print(num_duplicates, "duplicates were dropped!")
print("No more duplicate rows!")
else:
print("There are no duplicated observations in the dataset.")

In [62]: duplicate_values(df)

Duplicate check...
There are 24 duplicated observations in the dataset.
24 duplicates were dropped!
No more duplicate rows!

In [9]: df.isnull().sum().sum()

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26 … 7/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Out[9]: 4261

In [15]: ax = sns.countplot(x="income", data=df)

ax.bar_label(ax.containers[0]);

Our data is a unbalance data.

Features Summary
In [15]: # !pip install ipywidgets ydata-profiling
#from ydata_profiling import ProfileReport
#profile = ProfileReport(df, title="Profiling Report")
#profile.to_file("profiling_report.html")

In [16]: #!pip install summarytools

from summarytools import dfSummary
dfSummary(df)

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26 … 8/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Out[16]: Data Frame Summary

df
Dimensions: 32,537 x 15
Duplicates: 0

Freqs / (% of
No Variable Stats / Values Graph Missi
Valid)

Mean (sd) : 38.6

(13.6)
age min < med < max: 0
1 73 distinct values
[int64] 17.0 < 37.0 < 90.0 (0.0%)
IQR (CV) : 20.0
(2.8)

1. Private
22,673 (69.7%)
2. Self-emp-not-
2,540 (7.8%)
inc
2,093 (6.4%)
3. Local-gov
1,836 (5.6%)
workclass 4. nan 1,836
2 1,298 (4.0%)
[object] 5. State-gov (5.6%)
1,116 (3.4%)
6. Self-emp-inc
960 (3.0%)
7. Federal-gov
14 (0.0%)
8. Without-pay
7 (0.0%)
9. Never-worked

Mean (sd) :
189780.8
(105556.5)
min < med < max:
fnlwgt 21,648 distinct 0
3 12285.0 <
[int64] values (0.0%)
178356.0 <
1484705.0
IQR (CV) :
119166.0 (1.8)

1. HS-grad 10,494 (32.3%)

2. Some-college 7,282 (22.4%)
3. Bachelors 5,353 (16.5%)
4. Masters 1,722 (5.3%)
5. Assoc-voc 1,382 (4.2%)
education 0
4 6. 11th 1,175 (3.6%)
[object] (0.0%)
7. Assoc-acdm 1,067 (3.3%)
8. 10th 933 (2.9%)
9. 7th-8th 645 (2.0%)
10. Prof-school 576 (1.8%)
11. other 1,908 (5.9%)

Mean (sd) : 10.1

(2.6)
education.num 0
5 min < med < max: 16 distinct values
[int64] (0.0%)
1.0 < 10.0 < 16.0
IQR (CV) : 3.0 (3.9)

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26 … 9/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Freqs / (% of
No Variable Stats / Values Graph Missi
Valid)

1. Married-civ-
spouse
14,970 (46.0%)
2. Never-married
10,667 (32.8%)
3. Divorced
4,441 (13.6%)
marital.status 4. Separated 0
6 1,025 (3.2%)
[object] 5. Widowed (0.0%)
993 (3.1%)
6. Married-
418 (1.3%)
spouse-absent
23 (0.1%)
7. Married-AF-
spouse

1. Prof-specialty
2. Craft-repair
4,136 (12.7%)
3. Exec-managerial
4,094 (12.6%)
4. Adm-clerical
4,065 (12.5%)
5. Sales
3,768 (11.6%)
6. Other-service
3,650 (11.2%)
occupation 7. Machine-op- 1,843
7 3,291 (10.1%)
[object] inspct (5.7%)
2,000 (6.1%)
8. nan
1,843 (5.7%)
9. Transport-
1,597 (4.9%)
moving
1,369 (4.2%)
10. Handlers-
2,724 (8.4%)
cleaners
11. other

1. Husband 13,187 (40.5%)

2. Not-in-family 8,292 (25.5%)
relationship 3. Own-child 5,064 (15.6%) 0
8
[object] 4. Unmarried 3,445 (10.6%) (0.0%)
5. Wife 1,568 (4.8%)
6. Other-relative 981 (3.0%)

1. White
2. Black 27,795 (85.4%)
3. Asian-Pac- 3,122 (9.6%)
race 0
9 Islander 1,038 (3.2%)
[object] (0.0%)
4. Amer-Indian- 311 (1.0%)
Eskimo 271 (0.8%)
5. Other

sex 1. Male 21,775 (66.9%) 0

10
[object] 2. Female 10,762 (33.1%) (0.0%)

Mean (sd) : 1078.4

(7388.0)
capital.gain min < med < max: 0
11 119 distinct values
[int64] 0.0 < 0.0 < (0.0%)
99999.0
IQR (CV) : 0.0 (0.1)

Mean (sd) : 87.4

(403.1)
capital.loss 0
12 min < med < max: 92 distinct values
[int64] (0.0%)
0.0 < 0.0 < 4356.0
IQR (CV) : 0.0 (0.2)

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 10/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Freqs / (% of
No Variable Stats / Values Graph Missi
Valid)

Mean (sd) : 40.4

(12.3)
hours.per.week 0
13 min < med < max: 94 distinct values
[int64] (0.0%)
1.0 < 40.0 < 99.0
IQR (CV) : 5.0 (3.3)

1. United-States 29,153 (89.6%)

2. Mexico 639 (2.0%)
3. nan 582 (1.8%)
4. Philippines 198 (0.6%)
5. Germany 137 (0.4%)
native.country 582
14 6. Canada 121 (0.4%)
[object] (1.8%)
7. Puerto-Rico 114 (0.4%)
8. El-Salvador 106 (0.3%)
9. India 100 (0.3%)
10. Cuba 95 (0.3%)
11. other 1,292 (4.0%)

income 1. <=50K 24,698 (75.9%) 0

15
[object] 2. >50K 7,839 (24.1%) (0.0%)

In [17]: import math

num_cols = df.iloc[:, :-1].shape[1]
num_rows = math.ceil(num_cols / 3)

plt.figure(figsize=(15, 5 * num_rows))
for i, col in enumerate(df.iloc[:, :-1].columns, 1):
plt.subplot(num_rows, 3, i)
plt.title(f"Distribution of {col} Data")
sns.histplot(df[col], kde=True)
plt.tight_layout()
plt.show()

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 11/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [10]: num_cols= df.select_dtypes('number').columns

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 12/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

skew_limit = 0.75 # define a limit above which we will log transf

skew_vals = df[num_cols].skew()

# Showing the skewed columns

skew_cols = (skew_vals
.sort_values(ascending=False)
.to_frame()
.rename(columns={0:'Skew'})
.query('abs(Skew) > {}'.format(skew_limit)))
skew_cols

Out[10]: Skew

capital.gain 11.949403

capital.loss 4.592702

fnlwgt 1.447703

In [20]: sns.pairplot(df, hue= "income", corner=True);

In [63]: cat_features = df.select_dtypes(include=['object']).columns.tolist()

cat_features = [col for col in cat_features if col != 'income']
cat_features

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 13/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Out[63]: ['workclass',
'education',
'marital.status',
'occupation',
'relationship',
'race',
'sex',
'native.country']

In [64]: num_features = df.select_dtypes(include=['number']).columns.tolist()

In [65]: df['income'] = df['income'].apply(lambda x: 0 if x == '<=50K' else 1)

Handling Missing Values

In [24]: df.isnull().sum().sum()

Out[24]: 4261

In [25]: missing_count = df.isnull().sum()

value_count = df.isnull().count()
missing_percentage = round(missing_count / value_count * 100, 2)
missing_df = pd.DataFrame({"count": missing_count, "percentage": missing_percen
missing_df

Out[25]: count percentage

age 0 0.00

workclass 1836 5.64

fnlwgt 0 0.00

education 0 0.00

education.num 0 0.00

marital.status 0 0.00

occupation 1843 5.66

relationship 0 0.00

race 0 0.00

sex 0 0.00

capital.gain 0 0.00

capital.loss 0 0.00

hours.per.week 0 0.00

native.country 582 1.79

income 0 0.00

In [30]: # !pip install missingno

import missingno as msno

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 14/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

msno.matrix(df);

In [66]: num_imputer = SimpleImputer(strategy='median')

cat_imputer = SimpleImputer(strategy='most_frequent')

# Impute numerical columns

df[num_features] = num_imputer.fit_transform(df[num_features])

# Impute categorical columns

df[cat_features] = cat_imputer.fit_transform(df[cat_features])

In [27]: # Let's observe our data in a table

def get_unique_values(df):

output_data = []

for col in df.columns:

# If the number of unique values in the column is less than or equal to

if df.loc[:, col].nunique() <= 10:
# Get the unique values in the column
unique_values = df.loc[:, col].unique()
# Append the column name, number of unique values, unique values, a
output_data.append([col, df.loc[:, col].nunique(), unique_values, d
else:
# Otherwise, append only the column name, number of unique values,
output_data.append([col, df.loc[:, col].nunique(),"-", df.loc[:, co

output_df = pd.DataFrame(output_data, columns=['Column Name', 'Number of Un

return output_df

In [28]: get_unique_values(df)

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 15/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Out[28]: Column Number of Unique Data

Unique Values
Name Values Type

0 age 73 - float64

[Private, State-gov, Federal-gov,

1 workclass 8 object
Self-emp-not...

2 fnlwgt 21648 - float64

3 education 16 - object

4 education.num 16 - float64

[Widowed, Divorced, Separated,

5 marital.status 7 object
Never-married, ...

6 occupation 14 - object

[Not-in-family, Unmarried, Own-

7 relationship 6 object
child, Other-re...

[White, Black, Asian-Pac-Islander,

8 race 5 object
Other, Amer...

9 sex 2 [Female, Male] object

10 capital.gain 119 - float64

11 capital.loss 92 - float64

12 hours.per.week 94 - float64

13 native.country 41 - object

14 income 2 [0, 1] int64

In [67]: import plotly.graph_objects as go

from plotly.subplots import make_subplots

fig = make_subplots(rows=1, cols=2,

subplot_titles=("Unique values per Categorical feature", "U

for col_type, col, color in [("exclude", 1, '#016CC9'), ("include", 2, '#DEB078

temp_data = df.select_dtypes(**{col_type: "number"}).nunique().sort_values(
fig.add_trace(go.Bar(x=temp_data.index, y=temp_data.values, marker=dict(col

fig.show()

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 16/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Feature Engineering and Outliers

Categorical Features
In [11]: df[cat_features].columns

Out[11]: Index(['workclass', 'education', 'marital.status', 'occupation',

'relationship', 'race', 'sex', 'native.country'],
dtype='object')

In [31]: sorted_workclass = ['Private', 'Self-emp-not-inc', 'Local-gov', 'State-gov', 'S

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))

counts = df['workclass'].value_counts().reindex(sorted_workclass[::-1])
counts.plot(kind="barh", ax=ax1, color="teal")
ax1.set_title('Workclass', fontsize=16)
ax1.bar_label(ax1.containers[0], labels=counts.values, fontsize=12)
ax1.tick_params(axis='y', labelsize=16)

# 2.grafig: Workclass Distribution by Income

sns.countplot(y=df["workclass"], hue=df['income'].astype(str), ax=ax2, palette=
ax2.set_title('Workclass Distribution by Income', fontsize=16)

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 17/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

ax2.legend(title='Income', loc='lower right', fontsize=12, title_fontsize='18')

ax2.tick_params(axis='y', labelsize=16);

General Insights

The private sector is the most dominant category among the work classes and
creates a significant disparity in income distribution.

Among self-employed individuals, those who are incorporated earn higher

incomes compared to those who are not incorporated.

For local, state, and federal government jobs, the low-income category is
dominant; however, a significant portion also falls into the high-income category.

Individuals who work without pay and those who have never worked are generally
found in the low-income category.

In [32]: sorted_education = df['education'].value_counts().index[::-1]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))

# Birinci grafik: Top Education Levels

counts = df['education'].value_counts().reindex(sorted_education)
counts.plot(kind="barh", ax=ax1, color="teal")
ax1.set_title('Education Levels', fontsize=16)
ax1.bar_label(ax1.containers[0], labels=counts.values, fontsize=16)
ax1.tick_params(axis='y', labelsize=16)

# İkinci grafik: Education Distribution by Income

sns.countplot(y=df["education"], hue=df['income'].astype(str), ax=ax2, palette=
ax2.set_title('Education Distribution by Income', fontsize=16)
ax2.legend(title='Income', loc='lower right', fontsize=16, title_fontsize='18')
ax2.tick_params(axis='y', labelsize=16)

plt.tight_layout()
plt.show()

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 18/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [68]: df['education'].replace(['1st-4th', '5th-6th'], 'elementary_school', inplace=Tr

df['education'].replace(['7th-8th', '9th', '10th', '11th', '12th'], 'secondary_
df['education'].replace(['Assoc-acdm', 'Assoc-voc'], 'Assoc', inplace=True)

Category Merging: Dividing education levels into too many categories can complicate
data analysis and modeling processes. Therefore, similar levels have been combined to
form larger and more meaningful categories.

In [34]: fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))

# Birinci grafik: Top Education Levels

counts = df['marital.status'].value_counts()
counts.plot(kind="barh", ax=ax1, color="teal")
ax1.set_title('marital.status', fontsize=16)
ax1.bar_label(ax1.containers[0], labels=counts.values, fontsize=16)
ax1.tick_params(axis='y', labelsize=16)

# İkinci grafik: Education Distribution by Income

sns.countplot(y=df["marital.status"], hue=df['income'].astype(str), ax=ax2, pal
ax2.set_title('marital.status Distribution by Income', fontsize=16)
ax2.legend(title='Income', loc='lower right', fontsize=16, title_fontsize='18')
ax2.tick_params(axis='y', labelsize=16)

plt.tight_layout()
plt.show()

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 19/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [69]: df['marital.status'].replace(
['Never-married'], 'NotMarried', inplace=True
)
df['marital.status'].replace(
['Married-AF-spouse', 'Married-civ-spouse'], 'Married', inplace=True
)
df['marital.status'].replace(
['Married-spouse-absent', 'Separated'], 'Separated', inplace=True
)
df['marital.status'].replace(
['Divorced', 'Widowed'], 'Widowed', inplace=True
)

Marital Status Categories Merging In order to simplify the analysis and improve
model performance, we combined similar marital status categories. This helps in
reducing the number of distinct categories, making the data more manageable and the
results more interpretable.

In [41]: sns.histplot(data=df, x='age', hue='marital.status', multiple='stack', palette=

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 20/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

General Insights

Marriage and Age: Marriage rates are low among young adults, peak in middle age,
and decline again in older age. This indicates that focusing on education and career is
common in early life, marriage and family building are more prevalent in middle age,
and loss of a spouse increases in older age.

Tendency Not to Marry: The non-marriage rates are higher among younger age
groups, suggesting that education and career-oriented lifestyles are more common
in modern societies.

Loss of Spouse and Separations: Widowhood is more common in older age,

while separations are more concentrated in middle age. This suggests that both
increased rates of spouse loss due to health reasons and midlife crises or marital
problems are more frequent in these age groups.

In [42]: sns.countplot(y='occupation', hue='income', data=df, order=df['occupation'].val

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 21/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [43]: sns.boxplot(y='occupation', x='age', data=df, order=df['occupation'].value_coun

In [44]: sns.countplot(y='occupation', hue='sex', data=df, order=df['occupation'].value_

plt.title('Occupation Distribution by Sex', fontsize=16);

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 22/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [46]: pivot_table = df.pivot_table(index='education', columns='occupation', aggfunc=

sns.heatmap(pivot_table, annot=True, fmt='d', cmap='viridis');

General Insights

General InsightsIncome and Occupation: Professional and managerial roles yield

higher incomes, while service and manual labor roles are lower-income.

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 23/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Age and Occupation: Older individuals are more prevalent in high-responsibility

roles, whereas younger individuals occupy more entry-level or physically
demanding jobs.
Gender and Occupation: There are significant gender disparities, with males
dominating technical and managerial fields and females more present in clerical
and service roles.
Education and Occupation: Higher education levels correlate with higher-level
occupations, whereas lower education levels are sufficient for service and manual
jobs.

In [47]: sns.countplot(y='relationship', hue='income', data=df, order=df['relationship']

In [48]: sns.countplot(y='race', hue='income', data=df, order=df['race'].value_counts()

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 24/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [71]: df['race'].replace(['Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'Other'],' Othe

In [15]: fig = plt.figure(figsize = (10,6))

ax = fig.add_axes([0,0,1,1])
counts = df["native.country"].value_counts().sort_values(ascending=False).head(
counts.plot(kind = "bar")
plt.title('Top 20 Brand')
plt.xlabel('native.country')
plt.xticks(rotation = 90)
ax.bar_label(ax.containers[0], labels=counts.values, fontsize=12);

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 25/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [70]: df['native.country'] = df['native.country'].replace({

"USA": "United-States"
}).apply(lambda x: "United-States" if x == "United-States" else ("Mexico" if x

General Insights:

Data Imbalance:

The data is heavily skewed towards individuals from the United States, which could
impact the generalizability of any models or analyses performed.

The dataset is predominantly composed of individuals from the United States, with a
minor but noticeable representation from Mexico and a variety of other countries. This
heavy imbalance towards the US population suggests the need for careful handling of
data to avoid biases.

Given the significant representation from Mexico, segmented analyses (e.g., comparing
outcomes between US natives and Mexican immigrants) might be feasible and
insightful.

For other countries with smaller representations, aggregated analyses might be more
appropriate.

Numerical Features
In [54]: df['age_bin'] = pd.cut(df['age'], bins=20)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 6))

sns.countplot(y='age_bin', data=df, palette='viridis', ax=ax1)
sns.histplot(data=df, x='age', hue='income', kde=True, palette='viridis', ax=ax

In [72]: px.histogram(df, x='capital.gain', color="income", barmode='group', title='Inco

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 26/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [73]: px.histogram(df, x='capital.loss', color="income", barmode='group', title='Inco

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 27/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [74]: df['capital_diff'] = df['capital.gain'] - df['capital.loss']

df['capital_diff'] = pd.cut(df['capital_diff'], bins = [-5000, 5000, 100000], l
df['capital_diff'] = df['capital_diff'].astype('object')
df.drop(['capital.gain'], axis = 1, inplace = True)
df.drop(['capital.loss'], axis = 1, inplace = True)

Purpose: To combine the capital.gain (capital gain) and capital.loss (capital loss)
columns into a single column to calculate the net capital gain.

Result: A new column named capital_diff is created.

In [75]: px.histogram(df, x='capital_diff', color="income", barmode='group', title='Inco

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 28/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [76]: px.histogram(df, x='hours.per.week', color="income", barmode='group', title='In

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 29/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [43]: sns.boxplot(data=df,y="hours.per.week",x='income', whis=3);

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 30/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [44]: outliers = df[df['hours.per.week'] > 80]

outliers_income_counts = outliers['income'].value_counts()
outliers_income_counts

Out[44]: income
0 145
1 63
Name: count, dtype: int64

In [45]: outliers = df[df['hours.per.week'] < 15]

outliers_income_counts = outliers['income'].value_counts()
outliers_income_counts

Out[45]: income
0 892
1 81
Name: count, dtype: int64

In [77]: df = df[~((df["hours.per.week"] > 80) | (df["hours.per.week"] < 15))]

The code segments are used to analyze the income status of individuals with extremely
high or low weekly working hours and to remove these outliers from the dataset.

In [78]: df.drop(['fnlwgt'], axis = 1, inplace = True)

fnlwgt: As a result of the analysis, the effect of fnlwgt on the model is almost
negligible. Therefore, it was excluded from the data.

In [48]: df.shape

Out[48]: (31356, 13)

In [49]: df.columns

Out[49]: Index(['age', 'workclass', 'education', 'education.num', 'marital.status',

'occupation', 'relationship', 'race', 'sex', 'hours.per.week',
'native.country', 'income', 'capital_diff'],
dtype='object')

Correlation
In [79]: numeric_df = df.select_dtypes(include=['number'])
corr_matrix = numeric_df.corr()

In [86]: sns.heatmap(corr_matrix .corr(), annot=True, cmap="Blues");

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 31/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [87]: def plot_target_correlation_heatmap(df, target_variable):

df_numeric = df.select_dtypes(include=[np.number])
df_corr_target = df_numeric.corr()

plt.figure(figsize=(2, 7))
sns.heatmap(df_corr_target[[target_variable]], annot=True, vmin=-1, vmax=1,
plt.title(f'Correlation with {target_variable}')
plt.show()
plot_target_correlation_heatmap(df, 'income')

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 32/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Multicollinearity
In [52]: def color_correlation1(val):
"""
Takes a scalar and returns a string with
the css property in a variety of color scales
for different correlations.
"""
if val >= 0.6 and val < 0.99999 or val <= -0.6 and val > -0.99999:
color = 'red'
elif val < 0.6 and val >= 0.3 or val > -0.6 and val <= -0.3:
color = 'blue'
elif val == 1:
color = 'green'
else:
color = 'black'
return 'color: %s' % color

numeric_df = df.select_dtypes(include=[np.number])

numeric_df.corr().style.applymap(color_correlation1)

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 33/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Out[52]: age education.num hours.per.week income

age 1.000000 0.035414 0.110759 0.244210

education.num 0.035414 1.000000 0.163208 0.336660

hours.per.week 0.110759 0.163208 1.000000 0.241994

income 0.244210 0.336660 0.241994 1.000000

In [80]: X = df.drop("income", axis=1)

y = df['income']

Models

Train | Test Split

In [81]: X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.2,
stratify=y,
random_state=42)

make_column_transformer
In [24]: df.columns

Out[24]: Index(['age', 'workclass', 'education', 'education.num', 'marital.status',

'occupation', 'relationship', 'race', 'sex', 'hours.per.week',
'native.country', 'income', 'capital_diff'],
dtype='object')

In [82]: cat_onehot = [
'workclass', 'occupation', 'relationship', 'race', 'sex', 'native.country',
'marital.status'
]
cat_ordinal = ['education', 'capital_diff']

cat_for_edu = [
'Preschool', 'elementary_school', 'secondary_school', 'HS-grad',
'Some-college', 'Assoc', 'Bachelors', 'Masters', 'Prof-school', 'Doctorate
]
cat_for_capdiff = ['Low', 'High']

In [83]: column_trans = make_column_transformer(

(OneHotEncoder(handle_unknown="ignore", sparse_output=False), cat_onehot),
(OrdinalEncoder(categories=[cat_for_edu, cat_for_capdiff]), cat_ordinal),
remainder=StandardScaler())

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 34/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Logistic Regression Model

In [84]: operations = [("transformer", column_trans), ("logistic", LogisticRegression(ma

pipe_model = Pipeline(steps=operations)

pipe_model.fit(X_train, y_train)

Out[84]: ▸ Pipeline i ?

▸ transformer: ColumnTransformer ?

▸ onehotencoder ▸ ordinalencoder ▸ remainder

▸ ▸ ▸
OneHotEncoder OrdinalEncoder StandardScaler
? ? ?

▸ LogisticRegression ?

In [85]: ConfusionMatrixDisplay.from_estimator(pipe_model,
X_test,
y_test,
normalize='true');

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 35/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [86]: from yellowbrick.classifier import ClassPredictionError

visualizer = ClassPredictionError(pipe_model)
# Fit the training data to the visualizer
visualizer.fit(X_train, y_train)
# Evaluate the model on the test data
visualizer.score(X_test, y_test)
# Draw visualization
visualizer.poof();

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 36/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [87]: def eval_metric(model, X_train, y_train, X_test, y_test,i):

y_train_pred = model.predict(X_train)
y_pred = model.predict(X_test)
print(f"{i} Test_Set")
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print()
print(f"{i} Train_Set")
print(confusion_matrix(y_train, y_train_pred))
print(classification_report(y_train, y_train_pred))

In [88]: eval_metric(pipe_model, X_train, y_train, X_test, y_test, "logistic")

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 37/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

logistic Test_Set
[[4443 290]
[ 636 903]]
precision recall f1-score support

0 0.87 0.94 0.91 4733

1 0.76 0.59 0.66 1539

accuracy 0.85 6272

macro avg 0.82 0.76 0.78 6272
weighted avg 0.85 0.85 0.85 6272

logistic Train_Set
[[17632 1296]
[ 2528 3628]]
precision recall f1-score support

0 0.87 0.93 0.90 18928

1 0.74 0.59 0.65 6156

accuracy 0.85 25084

macro avg 0.81 0.76 0.78 25084
weighted avg 0.84 0.85 0.84 25084

Cross Validate
In [113… operations = [("transformer", column_trans), ("logistic", LogisticRegression(ma

pipecv_model = Pipeline(steps=operations)

cv = StratifiedKFold(n_splits=10)

scores = cross_validate(pipecv_model,
X_train,
y_train,
scoring=["accuracy", "precision", "recall", "f1"],
cv=cv,
return_train_score = True)
df_scores = pd.DataFrame(scores, index=range(1,11))
df_scores.mean()[2:]

Out[113… test_accuracy 0.846596

train_accuracy 0.847543
test_precision 0.734494
train_precision 0.736169
test_recall 0.588371
train_recall 0.590355
test_f1 0.653002
train_f1 0.655246
dtype: float64

Precision Recall Curve and Roc Curve Display

In [30]: RocCurveDisplay.from_estimator(pipe_model, X_test, y_test);

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 38/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [31]: PrecisionRecallDisplay.from_estimator(pipe_model, X_test, y_test);

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 39/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

GridSearchCV
param_grid = [ { "logistic__penalty" : ['l1', 'l2'], "logistic__C" : [0.01, 0.05,0.03, 0.1, 1],
"logistic__class_weight": ["balanced", None] , "logistic__solver": ['liblinear', 'saga', 'lbfgs'],
"logistic__max_iter": [1000, 2000] } ]

Many grids of money have been tried. Finally, the following features were identified.

In [89]: operations = [("transformer", column_trans), ("logistic", LogisticRegression(ra

log_model = Pipeline(steps=operations)

param_grid = [
{
"logistic__penalty" : ['l1'],
"logistic__C" : [0.03],
"logistic__class_weight": ["balanced"] ,
"logistic__solver": ['saga'],
"logistic__max_iter": [1000]
}
]
cv = StratifiedKFold(n_splits = 10)

grid_model = GridSearchCV(estimator=log_model,
param_grid=param_grid,
cv=cv,
scoring = "f1",

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 40/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

n_jobs = -1,
return_train_score=True).fit(X_train, y_train)

In [90]: grid_model.best_estimator_

Out[90]: ▸ Pipeline i ?

▸ transformer: ColumnTransformer ?

▸ onehotencoder ▸ ordinalencoder ▸ remainder

▸ ▸ ▸
OneHotEncoder OrdinalEncoder StandardScaler
? ? ?

▸ LogisticRegression ?

In [91]: grid_model.best_score_

Out[91]: 0.683264633932356

In [92]: grid_model.best_index_

Out[92]: 0

In [93]: pd.DataFrame(grid_model.cv_results_).loc[0, ["mean_test_score", "mean_train_sco

Out[93]: mean_test_score 0.683265

mean_train_score 0.682596
Name: 0, dtype: object

In [94]: y_pred = grid_model.predict(X_test)

y_pred_proba = grid_model.predict_proba(X_test)

log_f1 = f1_score(y_test, y_pred)

log_recall = recall_score(y_test, y_pred)

log_auc = roc_auc_score(y_test, y_pred)

precision, recall, _ = precision_recall_curve(y_test, grid_model.predict_proba(

log_prc = auc(recall, precision)

log_grid_model = eval_metric(grid_model, X_train, y_train, X_test, y_test,"logi

log_grid_model

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 41/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

logisticgrid Test_Set
[[3783 950]
[ 253 1286]]
precision recall f1-score support

0 0.94 0.80 0.86 4733

1 0.58 0.84 0.68 1539

accuracy 0.81 6272

macro avg 0.76 0.82 0.77 6272
weighted avg 0.85 0.81 0.82 6272

logisticgrid Train_Set
[[15047 3881]
[ 952 5204]]
precision recall f1-score support

0 0.94 0.79 0.86 18928

1 0.57 0.85 0.68 6156

accuracy 0.81 25084

macro avg 0.76 0.82 0.77 25084
weighted avg 0.85 0.81 0.82 25084

In [95]: log_grid_matrix = ConfusionMatrixDisplay.from_estimator(grid_model,

X_test,
y_test,
normalize='true');

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 42/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [96]: RocCurveDisplay.from_estimator(grid_model, X_test, y_test);

In [97]: PrecisionRecallDisplay.from_estimator(grid_model, X_test, y_test);

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 43/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

KNN Model
In [39]: operations = [("transformer", column_trans), ("knn", KNeighborsClassifier())]

pipe_model = Pipeline(steps=operations)

pipe_model.fit(X_train, y_train)

Out[39]: ▸ Pipeline i ?

▸ transformer: ColumnTransformer ?

▸ onehotencoder ▸ ordinalencoder ▸ remainder

▸ ▸ ▸
OneHotEncoder OrdinalEncoder StandardScaler
? ? ?

▸ KNeighborsClassifier ?

In [164… eval_metric(pipe_model, X_train, y_train, X_test, y_test, "knn")

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 44/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

knn Test_Set
[[4286 447]
[ 655 884]]
precision recall f1-score support

0 0.87 0.91 0.89 4733

1 0.66 0.57 0.62 1539

accuracy 0.82 6272

macro avg 0.77 0.74 0.75 6272
weighted avg 0.82 0.82 0.82 6272

knn Train_Set
[[17746 1182]
[ 1881 4275]]
precision recall f1-score support

0 0.90 0.94 0.92 18928

1 0.78 0.69 0.74 6156

accuracy 0.88 25084

macro avg 0.84 0.82 0.83 25084
weighted avg 0.87 0.88 0.88 25084

In [168… RocCurveDisplay.from_estimator(pipe_model, X_test, y_test);

In [169… PrecisionRecallDisplay.from_estimator(pipe_model, X_test, y_test);

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 45/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Elbow Method for Choosing Reasonable K Values

In [98]: operations = [("transformer", column_trans), ("knn", KNeighborsClassifier())]

pipe_model = Pipeline(steps=operations)

pipe_model.fit(X_train, y_train)

Out[98]: ▸ Pipeline i ?

▸ transformer: ColumnTransformer ?

▸ onehotencoder ▸ ordinalencoder ▸ remainder

▸ ▸ ▸
OneHotEncoder OrdinalEncoder StandardScaler
? ? ?

▸ KNeighborsClassifier ?

In [172… test_error_rates = []

for k in range(1, 10):

operations = [("transformer", column_trans), ("knn", KNeighborsClassifier(n

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 46/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

knn_pipe_model = Pipeline(steps=operations)

scores = cross_validate(knn_pipe_model, X_train, y_train, scoring = ['f1'],

f1_mean = scores["test_f1"].mean()

test_error = 1 - f1_mean

test_error_rates.append(test_error)

In [174… plt.figure(figsize=(15, 8))

plt.plot(range(1, 10),
test_error_rates,
color='red',
marker='o',
markerfacecolor='yellow',
markersize=10)

plt.title('Error Rate vs. K Value')

plt.xlabel('K_values')
plt.ylabel('Error Rate')
plt.hlines(y=0.25, xmin=0, xmax=20, colors='b', linestyles="--")
plt.hlines(y=0.65, xmin=0, xmax=20, colors='b', linestyles="--")

Out[174… <matplotlib.collections.LineCollection at 0x2604749d090>

Overfiting and underfiting control for k values

In [175… test_error_rates = []
train_error_rates = []

for k in range(1, 10):

operations = [("transformer", column_trans), ("knn", KNeighborsClassifier(n

knn_pipe_model = Pipeline(steps=operations)

knn_pipe_model.fit(X_train, y_train)

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 47/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

scores = cross_validate(knn_pipe_model, X_train, y_train, scoring = ['f1'],

f1_test_mean = scores["test_f1"].mean()
f1_train_mean = scores["train_f1"].mean()

test_error = 1 - f1_test_mean
train_error = 1 -f1_train_mean
test_error_rates.append(test_error)
train_error_rates.append(train_error)

In [176… plt.figure(figsize=(15, 8))

plt.plot(range(1, 10),
test_error_rates,
color='red',
marker='o',
markerfacecolor='yellow',
markersize=10)

plt.plot(range(1, 10),
train_error_rates,
color='red',
marker='o',
markerfacecolor='green',
markersize=10)

plt.title('Error Rate vs. K Value')

plt.xlabel('K_values')
plt.ylabel('Error Rate')
plt.hlines(y=0.25, xmin=0, xmax=20, colors='b', linestyles="--")
plt.hlines(y=0.65, xmin=0, xmax=20, colors='b', linestyles="--")

Out[176… <matplotlib.collections.LineCollection at 0x260475d7150>

In [177… k_list = [3, 5, 7]

for i in k_list:
operations = [("transformer", column_trans), ("knn", KNeighborsClassifier(n
knn = Pipeline(steps=operations)

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 48/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

knn.fit(X_train, y_train)
print(f'WITH K={i}\n')
eval_metric(knn, X_train, y_train, X_test, y_test, "knn_elbow")

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 49/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

WITH K=3

knn_elbow Test_Set
[[4236 497]
[ 684 855]]
precision recall f1-score support

0 0.86 0.89 0.88 4733

1 0.63 0.56 0.59 1539

accuracy 0.81 6272

macro avg 0.75 0.73 0.73 6272
weighted avg 0.80 0.81 0.81 6272

knn_elbow Train_Set
[[17848 1080]
[ 1568 4588]]
precision recall f1-score support

0 0.92 0.94 0.93 18928

1 0.81 0.75 0.78 6156

accuracy 0.89 25084

macro avg 0.86 0.84 0.85 25084
weighted avg 0.89 0.89 0.89 25084

WITH K=5

knn_elbow Test_Set
[[4286 447]
[ 655 884]]
precision recall f1-score support

0 0.87 0.91 0.89 4733

1 0.66 0.57 0.62 1539

accuracy 0.82 6272

macro avg 0.77 0.74 0.75 6272
weighted avg 0.82 0.82 0.82 6272

knn_elbow Train_Set
[[17746 1182]
[ 1881 4275]]
precision recall f1-score support

0 0.90 0.94 0.92 18928

1 0.78 0.69 0.74 6156

accuracy 0.88 25084

macro avg 0.84 0.82 0.83 25084
weighted avg 0.87 0.88 0.88 25084

WITH K=7

knn_elbow Test_Set
[[4315 418]
[ 647 892]]
precision recall f1-score support

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 50/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

0 0.87 0.91 0.89 4733

1 0.68 0.58 0.63 1539

accuracy 0.83 6272

macro avg 0.78 0.75 0.76 6272
weighted avg 0.82 0.83 0.83 6272

knn_elbow Train_Set
[[17663 1265]
[ 2017 4139]]
precision recall f1-score support

0 0.90 0.93 0.91 18928

1 0.77 0.67 0.72 6156

accuracy 0.87 25084

macro avg 0.83 0.80 0.82 25084
weighted avg 0.87 0.87 0.87 25084

Cross Validate For Optimal K Value

In [178… operations = operations = [("transformer", column_trans), ("knn", KNeighborsC

model = Pipeline(steps=operations)

scores = cross_validate(model,
X_train,
y_train,
scoring=['accuracy', 'precision', 'recall', 'f1'],
cv=10,
return_train_score=True)
df_scores = pd.DataFrame(scores, index=range(1, 11))
df_scores.mean()[2:]

Out[178… test_accuracy 0.834077

train_accuracy 0.868296
test_precision 0.684224
train_precision 0.763656
test_recall 0.602503
train_recall 0.671017
test_f1 0.640533
train_f1 0.714341
dtype: float64

Gridsearch Method for Choosing Reasonable K

Values
In [100… operations = [("transformer", column_trans), ("knn", KNeighborsClassifier())]
knn_model = Pipeline(steps=operations)

Many grids of money have been tried. Finally, the following features were
identified.Tried values up to k_values = 30.

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 51/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [101… param_grid = [
{
"knn__n_neighbors": [19],
"knn__metric": ['euclidean'],
"knn__weights": ['uniform']
}
]

knn_grid_model = GridSearchCV(knn_model,
param_grid,
scoring='f1',
cv=5,
return_train_score=True,
n_jobs=-1).fit(X_train, y_train)

In [102… knn_grid_model.best_estimator_

Out[102… ▸ Pipeline i ?

▸ transformer: ColumnTransformer ?

▸ onehotencoder ▸ ordinalencoder ▸ remainder

▸ ▸ ▸
OneHotEncoder OrdinalEncoder StandardScaler
? ? ?

▸ KNeighborsClassifier ?

In [103… knn_grid_model.best_index_

Out[103… 0

In [104… pd.DataFrame(
knn_grid_model.cv_results_).loc[0,["mean_test_score", "mean_train_score"]]

Out[104… mean_test_score 0.6397

mean_train_score 0.675013
Name: 0, dtype: object

In [105… knn_grid_model.best_score_

Out[105… 0.6396999259520801

In [106… y_pred = knn_grid_model.predict(X_test)

y_pred_proba = knn_grid_model.predict_proba(X_test)

knn_f1 = f1_score(y_test, y_pred)

knn_recall = recall_score(y_test, y_pred)

knn_auc = roc_auc_score(y_test, y_pred)

precision, recall, _ = precision_recall_curve(y_test, knn_grid_model.predict_pr

knn_prc = auc(recall, precision)

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 52/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

eval_metric(knn_grid_model, X_train, y_train, X_test, y_test, "knn_grid") #k=19

knn_grid Test_Set
[[4367 366]
[ 628 911]]
precision recall f1-score support

0 0.87 0.92 0.90 4733

1 0.71 0.59 0.65 1539

accuracy 0.84 6272

macro avg 0.79 0.76 0.77 6272
weighted avg 0.83 0.84 0.84 6272

knn_grid Train_Set
[[17492 1436]
[ 2269 3887]]
precision recall f1-score support

0 0.89 0.92 0.90 18928

1 0.73 0.63 0.68 6156

accuracy 0.85 25084

macro avg 0.81 0.78 0.79 25084
weighted avg 0.85 0.85 0.85 25084

As a result of the values we gave to K, the tests did not improve, but we
prevented overfitting and found more reliable results

Precision Recall Curve and Roc Curve Display

In [190… RocCurveDisplay.from_estimator(knn_grid_model, X_test, y_test);

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 53/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [191… y_pred_proba = knn.predict_proba(X_test)

roc_auc_score(y_test, y_pred_proba[:,1])

Out[191… 0.859903581601922

In [192… PrecisionRecallDisplay.from_estimator(pipe_model, X_test, y_test)

Out[192… <sklearn.metrics._plot.precision_recall_curve.PrecisionRecallDisplay at 0x2604

727de50>

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 54/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

SVM Model
In [107… operations = [("transformer", column_trans),("SVC", SVC(random_state=42))]
pipe_model = Pipeline(steps=operations)
pipe_model.fit(X_train, y_train)

Out[107… ▸ Pipeline i ?

▸ transformer: ColumnTransformer ?

▸ onehotencoder ▸ ordinalencoder ▸ remainder

▸ ▸ ▸
OneHotEncoder OrdinalEncoder StandardScaler
? ? ?

▸ SVC ?

Model Performance
In [194… eval_metric(pipe_model, X_train, y_train, X_test, y_test, "svm")

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 55/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

svm Test_Set
[[4498 235]
[ 692 847]]
precision recall f1-score support

0 0.87 0.95 0.91 4733

1 0.78 0.55 0.65 1539

accuracy 0.85 6272

macro avg 0.82 0.75 0.78 6272
weighted avg 0.85 0.85 0.84 6272

svm Train_Set
[[17901 1027]
[ 2756 3400]]
precision recall f1-score support

0 0.87 0.95 0.90 18928

1 0.77 0.55 0.64 6156

accuracy 0.85 25084

macro avg 0.82 0.75 0.77 25084
weighted avg 0.84 0.85 0.84 25084

In [198… operations = [("transformer", column_trans), ("SVC", SVC(random_state=42))]

pipe_model = Pipeline(steps=operations)

cv = StratifiedKFold(n_splits=5)

scores = cross_validate(pipe_model,
X_train,
y_train,
scoring=['accuracy', 'precision', 'recall', 'f1'],
cv=cv,
return_train_score=True,
n_jobs=-1)

df_scores = pd.DataFrame(scores, index=range(1, 6))

df_scores.mean()[2:]

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.

[Parallel(n_jobs=-1)]: Done 3 out of 5 | elapsed: 2.5min remaining: 1.7mi
n
[Parallel(n_jobs=-1)]: Done 5 out of 5 | elapsed: 6.7min finished
Out[198… test_accuracy 0.847353
train_accuracy 0.849097
test_precision 0.764130
train_precision 0.769085
test_recall 0.546947
train_recall 0.550357
test_f1 0.637474
train_f1 0.641591
dtype: float64

GridsearchCV

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 56/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

param_grid = {'SVC__C': [0.01, 0.1, 1, 10, 100], 'SVC__gamma': ["scale", "auto", 0.001,
0.01, 0.1, 0.5], 'SVC__kernel': ['rbf', 'linear'],}

Many grids of money have been tried. Finally, the following features were identified.

In [108… param_grid = {"SVC__C":[1],

"SVC__gamma":[0.3],
"SVC__kernel":["rbf"]}

operations = [("transformer", column_trans), ("SVC", SVC(class_weight="balanced

svm_model_grid = GridSearchCV(pipe_model,
param_grid,
scoring="recall_macro",
cv=5,
return_train_score=True,
n_jobs=2,
verbose=2).fit(X_train, y_train)

Fitting 5 folds for each of 1 candidates, totalling 5 fits

In [109… svm_model_grid.best_estimator_

Out[109… ▸ Pipeline i ?

▸ transformer: ColumnTransformer ?

▸ onehotencoder ▸ ordinalencoder ▸ remainder

▸ ▸ ▸
OneHotEncoder OrdinalEncoder StandardScaler
? ? ?

▸ SVC ?

In [110… svm_model_grid.best_index_

Out[110… 0

In [111… pd.DataFrame(
svm_model_grid.cv_results_).loc[0,
["mean_test_score", "mean_train_score"]]

Out[111… mean_test_score 0.773176

mean_train_score 0.807849
Name: 0, dtype: object

In [112… svm_model_grid.best_score_

Out[112… 0.7731760615298443

In [113… y_pred = svm_model_grid.predict(X_test)

y_pred_proba = svm_model_grid.decision_function(X_test)

svm_f1 = f1_score(y_test, y_pred)

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 57/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

svm_recall = recall_score(y_test, y_pred)

svm_auc = roc_auc_score(y_test, y_pred)

precision, recall, _ = precision_recall_curve(y_test, svm_model_grid.decision_f

svm_prc = auc(recall, precision)

eval_metric(svm_model_grid, X_train, y_train, X_test, y_test, "svm_grid")

svm_grid Test_Set
[[4409 324]
[ 607 932]]
precision recall f1-score support

0 0.88 0.93 0.90 4733

1 0.74 0.61 0.67 1539

accuracy 0.85 6272

macro avg 0.81 0.77 0.79 6272
weighted avg 0.85 0.85 0.85 6272

svm_grid Train_Set
[[17835 1093]
[ 2040 4116]]
precision recall f1-score support

0 0.90 0.94 0.92 18928

1 0.79 0.67 0.72 6156

accuracy 0.88 25084

macro avg 0.84 0.81 0.82 25084
weighted avg 0.87 0.88 0.87 25084

In [114… decision_function = svm_model_grid.decision_function(X_test)

average_precision_score(y_test, decision_function)

Out[114… 0.731964885295869

Precision Recall Curve and Roc Curve Display

In [52]: RocCurveDisplay.from_estimator(svm_model_grid, X_test, y_test);

Out[52]: <sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x2b5463ec710>

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 58/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

In [53]: PrecisionRecallDisplay.from_estimator(svm_model_grid, X_test, y_test);

Out[53]: <sklearn.metrics._plot.precision_recall_curve.PrecisionRecallDisplay at 0x2b54

625b950>

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 59/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Compare Models Performance

In [115… compare = pd.DataFrame({"Model": ["Logistic Regression", "KNN", "SVM"],
"F1": [log_f1, knn_f1, svm_f1 ],
"Recall": [log_recall, knn_recall, svm_recall ],
"ROC_AUC": [log_auc, knn_auc, svm_auc],
"PRC" : [log_prc, knn_prc, svm_prc]})
def labels(ax):
for p in ax.patches:
width = p.get_width() # get bar length
ax.text(width, # set the text at 1 unit r
p.get_y() + p.get_height() / 2, # get Y coordinate + X coo
'{:1.3f}'.format(width), # set variable to display,
ha = 'left', # horizontal alignment
va = 'center') # vertical alignment
plt.figure(figsize=(14,12))

plt.subplot(411)
compare = compare.sort_values(by="F1", ascending=False)
ax=sns.barplot(x="F1", y="Model", data=compare, palette="magma")
labels(ax)

plt.subplot(412)
compare = compare.sort_values(by="Recall", ascending=False)
ax=sns.barplot(x="Recall", y="Model", data=compare, palette="magma")
labels(ax)

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 60/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

plt.subplot(413)
compare = compare.sort_values(by="ROC_AUC", ascending=False)
ax=sns.barplot(x="ROC_AUC", y="Model", data=compare, palette="magma")
labels(ax)

plt.subplot(414)
compare = compare.sort_values(by="PRC", ascending=False)
ax=sns.barplot(x="PRC", y="Model", data=compare, palette="magma")
labels(ax)

plt.show()

Final Model and Model Deployment

In [118… operations = [("transformer", column_trans), ("logistic", LogisticRegression(ra

log_model = Pipeline(steps=operations)

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 61/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

final_pipe_model = GridSearchCV(estimator=log_model,
param_grid=param_grid,
cv=cv,
scoring = "f1",
n_jobs = -1,
return_train_score=True).fit(X, y)

In [119… import pickle

pickle.dump(final_model, open("final_pipe_model", "wb"))

In [120… new_model = pickle.load(open("final_pipe_model", "rb"))

new_model

Out[120… ▸ GridSearchCV i ?

▸ best_estimator_: Pipeline

▸ transformer: ColumnTransformer ?

▸ onehotencoder ▸ ordinalencoder ▸ remainder

▸ ▸ ▸
OneHotEncoder OrdinalEncoder StandardScaler
? ? ?

▸ LogisticRegression ?

Prediction
In [126… my_dict= {
'age': [44.0, 32.0, 30.0],
'workclass': ['Federal-gov', 'Private', 'Self-emp-not-inc'],
'education': ['Bachelors', 'Bachelors', 'Some-college'],
'education.num': [13.0, 13.0, 10.0],
'marital.status': ['Widowed', 'Married', 'NotMarried'],
'occupation': ['Tech-support', 'Sales', 'Sales'],
'relationship': ['Not-in-family', 'Husband', 'Other-relative'],
'race': ['White', 'White', 'Others'],
'sex': ['Male', 'Male', 'Male'],
'hours.per.week': [40.0, 40.0, 40.0],
'native.country': ['United-States', 'United-States', 'Other'],
'capital_diff': ['Low', 'Low', 'Low']
}

In [128… sample = pd.DataFrame(my_dict)

sample

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 62/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

Out[128… age workclass education education.num marital.status occupation relationship

Federal- Tech- Not-in-

0 44.0 Bachelors 13.0 Widowed
gov support family

1 32.0 Private Bachelors 13.0 Married Sales Husband

Self-emp- Some- Other-

2 30.0 10.0 NotMarried Sales
not-inc college relative

In [129… new_model.predict(sample)

Out[129… array([1, 1, 0], dtype=int64)

In [130… new_model.decision_function(sample)

Out[130… array([ 0.27369739, 1.16079392, -2.56091314])

📂 Conclusion
In [ ]: # Logistic grid recall: 83, f1 : 0.68 prc=0.75

In an unbalanced dataset, F1-Score and Recall metrics are indeed very important.
These metrics play a critical role in evaluating model performance in unbalanced
datasets, as they measure the model's ability to correctly predict the minority class.

When prioritizing F1-Score and Recall:

The Logistic Regression model stands out with a Recall of 0.83 and an F1-Score of
0.68. This model demonstrates balanced performance across the classes in the
unbalanced dataset, effectively capturing the minority class while also performing
well in overall classification.

The KNN Model, although it performs well in terms of accuracy, lags behind
Logistic Regression with a Recall of 0.59 and an F1-Score of 0.64. This indicates that
the model is less effective at capturing the minority class in the unbalanced
dataset.

The SVM Model, despite excelling in accuracy, also falls behind Logistic Regression
in these two metrics with a Recall of 0.60 and an F1-Score of 0.66. It is evident that
SVM is not sufficiently successful in capturing the minority class.

Based on these results, I can say that the Logistic Regression model offers the best
performance in terms of Recall and F1-Score for unbalanced datasets and should
therefore be preferred. Especially in unbalanced datasets, it is critical that the
model correctly identifies the minority class, making Logistic Regression the most
suitable choice.

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 63/64

13.08.2024 12:22 Fatma Nur Azman Logistic-KNN-SVM (Adult_Income_Prediction)

THANK YOU

If you want to be the first to be informed about new projects, please do not
forget to follow us - by Fatma Nur AZMAN
Fatmanurazman.com | Linkedin | Github | Kaggle | Tableau

localhost:8888/nbconvert/html/Desktop/FNUR-DATA-SCIENCE-06-04-2024/14-ML/00_Capstone Projects/4-Logistic Regression %26 KNN %26… 64/64

Adult Income Prediction
No ratings yet
Adult Income Prediction
9 pages
AI Pipeline for Income Prediction
No ratings yet
AI Pipeline for Income Prediction
16 pages
Report 1 AI17C DBM302m KhaiHoan BaoChau VanThu
No ratings yet
Report 1 AI17C DBM302m KhaiHoan BaoChau VanThu
6 pages
Understanding Married Civil Spouse Data
No ratings yet
Understanding Married Civil Spouse Data
2 pages
Dataiku - Predicting Income Levels Using US Census Data
No ratings yet
Dataiku - Predicting Income Levels Using US Census Data
14 pages
Adult Census Income Prediction
100% (1)
Adult Census Income Prediction
31 pages
Income Prediction Analysis
No ratings yet
Income Prediction Analysis
16 pages
Adult Income Prediction Using Machine Learning Algorithms: Submitted by
No ratings yet
Adult Income Prediction Using Machine Learning Algorithms: Submitted by
9 pages
Adult Income Prediction
0% (1)
Adult Income Prediction
9 pages
3 Essay in Labor and Education Economics
No ratings yet
3 Essay in Labor and Education Economics
165 pages
ML Project
No ratings yet
ML Project
112 pages
Final Shokhrukhsora Toshmukhamedova
No ratings yet
Final Shokhrukhsora Toshmukhamedova
10 pages
Final Juan Guillermo Lopez - Vargas
No ratings yet
Final Juan Guillermo Lopez - Vargas
10 pages
Machine Learning Klasifikasi 1752912401
No ratings yet
Machine Learning Klasifikasi 1752912401
3 pages
NYC Snapshot Feb'2012
No ratings yet
NYC Snapshot Feb'2012
21 pages
DhML4HRSEei8FQ6TnIKJEA C1 Assessment Workbook
No ratings yet
DhML4HRSEei8FQ6TnIKJEA C1 Assessment Workbook
198 pages
Mobility of States Summary
No ratings yet
Mobility of States Summary
3 pages
MA SecA Group7
No ratings yet
MA SecA Group7
20 pages
Engineering Grad Salary Insights
No ratings yet
Engineering Grad Salary Insights
1 page
Pursuing American Dream PDF
No ratings yet
Pursuing American Dream PDF
38 pages
Creative Commons Attribution-Noncommercial-Sharealike License
No ratings yet
Creative Commons Attribution-Noncommercial-Sharealike License
54 pages
Final Shokhrukhsora Toshmukhamedova
No ratings yet
Final Shokhrukhsora Toshmukhamedova
11 pages
dhML4HRSEei8FQ6TnIKJEA C1-Assessment-Workbook
No ratings yet
dhML4HRSEei8FQ6TnIKJEA C1-Assessment-Workbook
207 pages
Activity with-SOFA
No ratings yet
Activity with-SOFA
7 pages
Module 4 Assignment
No ratings yet
Module 4 Assignment
4 pages
Economist CV Variant A4 by Slidesgo 2
No ratings yet
Economist CV Variant A4 by Slidesgo 2
32 pages
(SpringerBriefs in Economics) Rajarshi Majumder (Auth.) - Intergenerational Mobility - A Study of Social Classes in India-Springer India (2013)
No ratings yet
(SpringerBriefs in Economics) Rajarshi Majumder (Auth.) - Intergenerational Mobility - A Study of Social Classes in India-Springer India (2013)
88 pages
India's Demographic Dividend
No ratings yet
India's Demographic Dividend
34 pages
Economic Growth-WPS Office
No ratings yet
Economic Growth-WPS Office
16 pages
Test Metrics
No ratings yet
Test Metrics
10 pages
Unit 1-4 BA 222 Special Topics
No ratings yet
Unit 1-4 BA 222 Special Topics
18 pages
Time-Wasters On Social Media
No ratings yet
Time-Wasters On Social Media
95 pages
Married Civil Spouse Demographics
No ratings yet
Married Civil Spouse Demographics
2 pages
Career Guidance - Data Analytics
No ratings yet
Career Guidance - Data Analytics
29 pages
Ahmedabad Women's Income Data
No ratings yet
Ahmedabad Women's Income Data
2,790 pages
US Median Income 2023
No ratings yet
US Median Income 2023
59 pages
FDI Employment Growth Presentation
No ratings yet
FDI Employment Growth Presentation
28 pages
Economic Inequality Causes-And-Solutions
No ratings yet
Economic Inequality Causes-And-Solutions
12 pages
البطالة، الدخل، الوظائف مفيد
No ratings yet
البطالة، الدخل، الوظائف مفيد
2 pages
RGDRGFD
No ratings yet
RGDRGFD
197 pages
AMP 2030 - The German Labour Market in The Year 2030 - 978-3-7639-5282-3-1
No ratings yet
AMP 2030 - The German Labour Market in The Year 2030 - 978-3-7639-5282-3-1
199 pages
Nguyen Final Project Report
No ratings yet
Nguyen Final Project Report
10 pages
Wealth Management Insights by Ratnam Chitturi
No ratings yet
Wealth Management Insights by Ratnam Chitturi
36 pages
McConnell 22e PPT Macro CH 04
No ratings yet
McConnell 22e PPT Macro CH 04
31 pages
Employment Trends and Skill Development in India
No ratings yet
Employment Trends and Skill Development in India
50 pages
Big Data's Role in Economic Mobility
No ratings yet
Big Data's Role in Economic Mobility
31 pages
File Word
No ratings yet
File Word
9 pages
Bakhshi Et Al. - 2017 - The Future of Skills Employment in 2030
No ratings yet
Bakhshi Et Al. - 2017 - The Future of Skills Employment in 2030
173 pages
Econopath: Navigating Dynamic Destinations: Worrawat Saijai Venue: Faculty of Economics 19 August 2023
No ratings yet
Econopath: Navigating Dynamic Destinations: Worrawat Saijai Venue: Faculty of Economics 19 August 2023
33 pages
(FREE PDF Sample) Immigration and Labor Market Mobility in Israel 1990 To 2009 1st Edition Zvi Eckstein Ebooks
No ratings yet
(FREE PDF Sample) Immigration and Labor Market Mobility in Israel 1990 To 2009 1st Edition Zvi Eckstein Ebooks
77 pages
Python Report
No ratings yet
Python Report
6 pages
Alchemy 4
No ratings yet
Alchemy 4
1 page
5 Buildings of Critical Regionalism
No ratings yet
5 Buildings of Critical Regionalism
6 pages
Midea PTAC Installation Manual 7,000-15,000 BTU
No ratings yet
Midea PTAC Installation Manual 7,000-15,000 BTU
23 pages
Thermodynamics for Engineers
No ratings yet
Thermodynamics for Engineers
14 pages
Construction Management Degree Guide
No ratings yet
Construction Management Degree Guide
27 pages
Mind Management Guide by Swami Mukundananda
No ratings yet
Mind Management Guide by Swami Mukundananda
2 pages
Police Code: General Manual of The Criminal Law
100% (1)
Police Code: General Manual of The Criminal Law
244 pages
Laws of Radicals: Product & Quotient Rules
No ratings yet
Laws of Radicals: Product & Quotient Rules
11 pages
Getting Started With Arduino Gives You Lots of
No ratings yet
Getting Started With Arduino Gives You Lots of
1 page
Whispers of
No ratings yet
Whispers of
47 pages
Press Release 7
No ratings yet
Press Release 7
2 pages
Locum Agreement 2
100% (1)
Locum Agreement 2
14 pages
Chapter 4 The Theory of Individual Behavior PDF
50% (2)
Chapter 4 The Theory of Individual Behavior PDF
14 pages
Radio Drama Rubric
100% (1)
Radio Drama Rubric
2 pages
Robotics LR 5
100% (1)
Robotics LR 5
2 pages
Material 111079 12210325 Textbook
No ratings yet
Material 111079 12210325 Textbook
170 pages
The Carnivore Diet For Beginners Recipes and Meal Plans For Weight Loss, Health, and Healing (Chris Irvin) (Z-Library)
100% (6)
The Carnivore Diet For Beginners Recipes and Meal Plans For Weight Loss, Health, and Healing (Chris Irvin) (Z-Library)
150 pages
(Herbal Reference Library) Kapoor, L. D - Handbook of Ayurvedic Medicinal Plants-CRC Press (2001)
No ratings yet
(Herbal Reference Library) Kapoor, L. D - Handbook of Ayurvedic Medicinal Plants-CRC Press (2001)
425 pages
Firm Objectives in Economics
No ratings yet
Firm Objectives in Economics
43 pages
Bayfront Park Solar License Agreement
No ratings yet
Bayfront Park Solar License Agreement
31 pages
Grade 7 English Curriculum Map
100% (8)
Grade 7 English Curriculum Map
19 pages
Mary Fae's Media Training Resume
No ratings yet
Mary Fae's Media Training Resume
4 pages
Garga Samhita
100% (1)
Garga Samhita
1,446 pages
Costing Accounting Problems
No ratings yet
Costing Accounting Problems
3 pages
Understanding ICT: Definitions & Distinctions
No ratings yet
Understanding ICT: Definitions & Distinctions
3 pages
The Girl I Fall in Love With
No ratings yet
The Girl I Fall in Love With
9 pages
Case # 89
No ratings yet
Case # 89
2 pages
NedapBrochure EAS ID - UK
No ratings yet
NedapBrochure EAS ID - UK
16 pages
Pesticides Use, Pesticides Trade
No ratings yet
Pesticides Use, Pesticides Trade
13 pages
Financial DD
No ratings yet
Financial DD
6 pages