28/04/2022 12:37 Modulo 4 - EDA.
ipynb - Colaboratory
Data Science com Python
Análise Exploratória de Dados
Prof.: Lucas Roberto Correa
LEMBRETE: Fazer o import dos datasets usados no ambiente do colab antes de executar os
comandos.
Import de pacotes
!pip install sweetviz
Collecting sweetviz
Downloading [Link] (15.1 MB)
|████████████████████████████████| 15.1 MB 2.9 MB/s
Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.7/dist-packages
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.7/dist-package
Requirement already satisfied: matplotlib>=3.1.3 in /usr/local/lib/python3.7/dist-pac
Requirement already satisfied: pandas!=1.0.0,!=1.0.1,!=1.0.2,>=0.25.3 in /usr/local/l
Requirement already satisfied: tqdm>=4.43.0 in /usr/local/lib/python3.7/dist-packages
Requirement already satisfied: jinja2>=2.11.1 in /usr/local/lib/python3.7/dist-packag
Requirement already satisfied: importlib-resources>=1.2.0 in /usr/local/lib/python3.7
Requirement already satisfied: zipp>=3.1.0 in /usr/local/lib/python3.7/dist-packages
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.7/dist-pack
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-pac
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-pac
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (fr
Installing collected packages: sweetviz
Successfully installed sweetviz-2.1.3
import sweetviz as sv
import pandas as pd
import seaborn as sns
import [Link] as plt
from IPython import display
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_colwidth', 10000)
[Link] 1/21
Jardeilsom do Nascimento Oliveira - blakjd2@[Link] - IP: [Link]
28/04/2022 12:37 Modulo 4 - [Link] - Colaboratory
Import da base
Fonte dos dados: [Link]
select=new_train.csv
metadata = pd.read_excel('[Link]')
metadata
Feature Feature_Type
0 age numeric
type of job ('admin.','blue-collar','entrepreneur','h
1 job Categorical,nominal
employed','services','stude
2 marital categorical,nominal marital status ('divorced','married','single','unknown'; note:
3 education categorical,nominal ('basic.4y','basic.6y','basic.9y','[Link]','illiterate','professiona
4 default categorical,nominal has
5 housing categorical,nominal h
6 loan categorical,nominal h
7 contact categorical,nominal contact co
8 month categorical,ordinal last contact month
9 dayofweek categorical,ordinal last contact da
last contact duration, in seconds . Important note: this attribute
10 duration numeric
11 campaign numeric number of contacts performed during this campaign a
number of days that passed by after the client was last co
12 pdays numeric
mea
13 previous numeric number of contacts performed
df = pd.read_csv('new_train.csv', sep=',')
[Link]()
[Link] 2/21
Jardeilsom do Nascimento Oliveira - blakjd2@[Link] - IP: [Link]
28/04/2022 12:37 Modulo 4 - [Link] - Colaboratory
age job marital education default housing loan contact month
0 49 blue-collar married basic.9y unknown no no cellular nov
1 37 entrepreneur married [Link] no no no telephone nov
# Explorar o output da biblioteca sweetviz em uma outra janela, com análise descritiva e g
2 78 retired married basic.4y no no no cellular jul
report = [Link](df)
3 36 admin. married [Link] no yes no telephone may
report.show_html('[Link]')
4 59 retired divorced [Link] no no no cellular jun
Done! Use 'show' commands to display/save.
Report [Link] was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop
Estatísticas básicas
# Método 'info' retorna diversas informações relacionadas ao Dataframe, dentre elas número
[Link]()
<class '[Link]'>
RangeIndex: 32950 entries, 0 to 32949
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 32950 non-null int64
1 job 32950 non-null object
2 marital 32950 non-null object
3 education 32950 non-null object
4 default 32950 non-null object
5 housing 32950 non-null object
6 loan 32950 non-null object
7 contact 32950 non-null object
8 month 32950 non-null object
9 day_of_week 32950 non-null object
10 duration 32950 non-null int64
11 campaign 32950 non-null int64
12 pdays 32950 non-null int64
13 previous 32950 non-null int64
14 poutcome 32950 non-null object
15 y 32950 non-null object
dtypes: int64(5), object(11)
memory usage: 4.0+ MB
# Número de linhas e colunas do Dataframe
[Link]
(32950, 16)
# Função len (length) para Dataframes retorna o número de linhas
len(df)
[Link] 3/21
Jardeilsom do Nascimento Oliveira - blakjd2@[Link] - IP: [Link]
28/04/2022 12:37 Modulo 4 - [Link] - Colaboratory
32950
# Método nunique retorna os valores únicos para cada variável (análogo ao "remover duplica
[Link]()
age 75
job 12
marital 4
education 8
default 3
housing 3
loan 3
contact 2
month 10
day_of_week 5
duration 1467
campaign 40
pdays 27
previous 8
poutcome 3
y 2
dtype: int64
Análise Univariada
# Retornar as 5 primeiras linhas do Dataframe (5 é o default, é possível alterar esse núme
df['age'].head()
0 49
1 37
2 78
3 36
4 59
Name: age, dtype: int64
# Retornar as 5 últimas linhas do Dataframe (mesmo default do 'head')
df['age'].tail()
32945 28
32946 52
32947 54
32948 29
32949 35
Name: age, dtype: int64
# Soma de todos os valores de uma coluna (no caso, coluna "age")
df['age'].sum()
[Link] 4/21
Jardeilsom do Nascimento Oliveira - blakjd2@[Link] - IP: [Link]
28/04/2022 12:37 Modulo 4 - [Link] - Colaboratory
1318465
# Valor mínimo observado para determinada coluna
df['age'].min()
17
# Valor médio
df['age'].mean()
40.01411229135053
# Valor máximo
df['age'].max()
98
# Boxplot dos dados referentes à coluna "Age". É possível observar onde estão dispostos os
[Link](x=df["age"])
<[Link]._subplots.AxesSubplot at 0x7f9a334c6050>
# O histograma também facilita a visualização da distribuição dos dados, fundamental na es
[Link](df['age'], 50, facecolor='b')
[Link]()
[Link] 5/21
Jardeilsom do Nascimento Oliveira - blakjd2@[Link] - IP: [Link]
28/04/2022 12:37 Modulo 4 - [Link] - Colaboratory
Medidas descritivas básicas
[Link](include='int64')
age duration campaign pdays previous
count 32950.000000 32950.000000 32950.000000 32950.000000 32950.000000
mean 40.014112 258.127466 2.560607 962.052413 0.174719
std 10.403636 258.975917 2.752326 187.951096 0.499025
min 17.000000 0.000000 1.000000 0.000000 0.000000
25% 32.000000 103.000000 1.000000 999.000000 0.000000
50% 38.000000 180.000000 2.000000 999.000000 0.000000
75% 47.000000 319.000000 3.000000 999.000000 0.000000
max 98.000000 4918.000000 56.000000 999.000000 7.000000
[Link](include='object')
job marital education default housing loan contact month day_
count 32950 32950 32950 32950 32950 32950 32950 32950
unique 12 4 8 3 3 3 2 10
top admin. married [Link] no yes no cellular may
freq 8314 19953 9736 26007 17254 27131 20908 11011
Análise de missings
[Link] 6/21
Jardeilsom do Nascimento Oliveira - blakjd2@[Link] - IP: [Link]
28/04/2022 12:37 Modulo 4 - [Link] - Colaboratory
[Link]().sum()
age 0
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 0
month 0
day_of_week 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
y 0
dtype: int64
Tabela de Frequencia
df['poutcome'].value_counts()
nonexistent 28416
failure 3429
success 1105
Name: poutcome, dtype: int64
df['contact'].value_counts()
cellular 20908
telephone 12042
Name: contact, dtype: int64
df['age'].value_counts().hist()
[Link] 7/21
Jardeilsom do Nascimento Oliveira - blakjd2@[Link] - IP: [Link]
28/04/2022 12:37 Modulo 4 - [Link] - Colaboratory
<[Link]._subplots.AxesSubplot at 0x7f9a31f76490>
prev_y = [Link](index=df["previous"], columns=df["y"],margins=True)
prev_y
y no yes All
previous
0 25915 2501 28416
1 2889 784 3673
2 324 282 606
3 74 101 175
4 29 31 60
5 4 10 14
6 2 3 5
7 1 0 1
All 29238 3712 32950
job_y = [Link](index=df["job"], columns=df["y"],margins=True)
job_y
y no yes All
job
admin. 7244 1070 8314
blue-collar 6926 515 7441
entrepreneur 1060 100 1160
housemaid 769 86 855
management 2076 269 2345
retired 1018 348 1366
self-employed 980 119 1099
services 2942 254 3196
student 494 217 711
technician 4815 585 5400
unemployed 682 116 798
unknown 232 33 265
All 29238 3712 32950
[Link] 8/21
Jardeilsom do Nascimento Oliveira - blakjd2@[Link] - IP: [Link]
28/04/2022 12:37 Modulo 4 - [Link] - Colaboratory
Histograma
[Link]
age int64
job object
marital object
education object
default object
housing object
loan object
contact object
month object
day_of_week object
duration int64
campaign int64
pdays int64
previous int64
poutcome object
y object
dtype: object
[Link](data=df, x="pdays")
<[Link]._subplots.AxesSubplot at 0x7f9a31ea48d0>
[Link](data=df, x="duration")
[Link] 9/21
Jardeilsom do Nascimento Oliveira - blakjd2@[Link] - IP: [Link]
28/04/2022 12:37 Modulo 4 - [Link] - Colaboratory
<[Link]._subplots.AxesSubplot at 0x7f9a31e146d0>
df['duration'].describe()
count 32950.000000
mean 258.127466
std 258.975917
min 0.000000
25% 103.000000
50% 180.000000
75% 319.000000
max 4918.000000
Name: duration, dtype: float64
df['duration'].median()
180.0
df['duration'].mode()
0 90
dtype: int64
[Link](data=df, x="campaign")
[Link] 10/21
Jardeilsom do Nascimento Oliveira - blakjd2@[Link] - IP: [Link]
28/04/2022 12:37 Modulo 4 - [Link] - Colaboratory
<[Link]._subplots.AxesSubplot at 0x7f9a33900790>
Boxplot
[Link](x=df["campaign"])
<[Link]._subplots.AxesSubplot at 0x7f9a338915d0>
df['campaign'].value_counts()
1 14121
2 8469
3 4300
4 2116
5 1255
6 773
7 493
8 329
9 220
10 187
11 142
12 92
13 74
14 52
17 51
15 45
16 42
18 27
20 22
21 20
19 16
[Link] 11/21
Jardeilsom do Nascimento Oliveira - blakjd2@[Link] - IP: [Link]
28/04/2022 12:37 Modulo 4 - [Link] - Colaboratory
22 13
24 12
23 12
27 9
25 8
26 7
31 7
29 7
28 6
30 6
35 4
33 3
43 2
32 2
42 2
34 1
37 1
40 1
56 1
Name: campaign, dtype: int64
[Link]("[Link]")
Grafico de Dispersão
[Link]
age int64
job object
marital object
education object
default object
housing object
loan object
contact object
month object
day_of_week object
duration int64
[Link] 12/21
Jardeilsom do Nascimento Oliveira - blakjd2@[Link] - IP: [Link]
28/04/2022 12:37 Modulo 4 - [Link] - Colaboratory
campaign int64
pdays int64
previous int64
poutcome object
y object
dtype: object
[Link](data=df, x="campaign", y="duration")
<[Link]._subplots.AxesSubplot at 0x7f9a2ddfa950>
[Link](data=df, x="pdays", y="duration")
<[Link]._subplots.AxesSubplot at 0x7f9a2dd7c110>
[Link] 13/21
Jardeilsom do Nascimento Oliveira - blakjd2@[Link] - IP: [Link]
28/04/2022 12:37 Modulo 4 - [Link] - Colaboratory
Correlações
[Link]()
age duration campaign pdays previous
age 1.000000 -0.001841 0.003302 -0.032011 0.020670
duration -0.001841 1.000000 -0.075663 -0.047127 0.022538
campaign 0.003302 -0.075663 1.000000 0.053795 -0.079051
pdays -0.032011 -0.047127 0.053795 1.000000 -0.589601
previous 0.020670 0.022538 -0.079051 -0.589601 1.000000
[Link]([Link](), annot=True, fmt="f")
<[Link]._subplots.AxesSubplot at 0x7f9a2dd624d0>
Plot de variáveis categoricas
[Link](x="duration", y="y", data=df)
[Link] 14/21
Jardeilsom do Nascimento Oliveira - blakjd2@[Link] - IP: [Link]
28/04/2022 12:37 Modulo 4 - [Link] - Colaboratory
<[Link] at 0x7f9a2dd82750>
[Link](x="campaign", y="y", data=df)
<[Link] at 0x7f9a2dc0b650>
[Link](x="age", y="y", data=df)
[Link] 15/21
Jardeilsom do Nascimento Oliveira - blakjd2@[Link] - IP: [Link]
28/04/2022 12:37 Modulo 4 - [Link] - Colaboratory
<[Link] at 0x7f9a2db7ec50>
Análise Multivariada
[Link](x="age", y="duration", hue="y", data=df);
[Link] 16/21
Jardeilsom do Nascimento Oliveira - blakjd2@[Link] - IP: [Link]
28/04/2022 12:37 Modulo 4 - [Link] - Colaboratory
Análise de Componentes Principais - PCA no contexto de Análise Multivariada
from [Link] import StandardScaler
from [Link] import PCA
metadata
Feature Feature_Type
0 age numeric
type of job ('admin.','blue-collar','entrepreneur','h
1 job Categorical,nominal
employed','services','stude
2 marital categorical,nominal marital status ('divorced','married','single','unknown'; note:
3 education categorical,nominal ('basic.4y','basic.6y','basic.9y','[Link]','illiterate','professiona
4 default categorical,nominal has
5 housing categorical,nominal h
6 loan categorical,nominal h
7 contact categorical,nominal contact co
8 month categorical,ordinal last contact month
9 dayofweek categorical,ordinal last contact da
last contact duration, in seconds . Important note: this attribute
10 duration numeric
11 campaign numeric number of contacts performed during this campaign a
number of days that passed by after the client was last co
12 pdays numeric
mea
13 previous numeric number of contacts performed
14 poutcome categorical,nominal outcome of the previous marketing ca
df_pca = df[['age', 'duration','campaign','pdays','previous']]
df_pca.head()
[Link] 17/21
Jardeilsom do Nascimento Oliveira - blakjd2@[Link] - IP: [Link]
28/04/2022 12:37 Modulo 4 - [Link] - Colaboratory
age duration campaign pdays previous
0 49 227 4 999 0
pca = PCA(n_components=2, random_state=42)
1 37 202 2 999 1
df_expl_pca = StandardScaler().fit_transform(df_pca)
2 78 1148 1 999 0
3 36 120 2 999 0
df_expl_pca
4 59 368 2 999 0
array([[ 0.86373877, -0.12019627, 0.52298128, 0.19658384, -0.35012691],
[-0.28972159, -0.2167318 , -0.20368791, 0.19658384, 1.65381294],
[ 3.65126795, 3.43617293, -0.56702251, 0.19658384, -0.35012691],
...,
[ 1.34434725, -0.49089273, 0.52298128, 0.19658384, -0.35012691],
[-1.05869515, -0.3596044 , -0.56702251, 0.19658384, -0.35012691],
[-0.48196498, 1.10387435, 0.15964669, 0.19658384, -0.35012691]])
result_pca = pca.fit_transform(df_expl_pca)
result_pca_df = [Link](result_pca,
columns=['component1','component2'])
result_pca_df
component1 component2
0 -0.425175 -0.509855
1 1.005371 -0.146158
2 0.265589 2.274575
3 -0.421084 -0.115342
4 -0.197363 0.194940
... ... ...
32945 -0.379635 0.451884
32946 1.095991 -0.530097
32947 -0.433674 -0.855301
32948 -0.384307 0.361312
32949 -0.324058 0.829408
32950 rows × 2 columns
O quanto eu estou conseguindo explicar da variabilidade dos dados?
pca.explained_variance_ratio_
array([0.32246681, 0.2116934 ])
[Link] 18/21
Jardeilsom do Nascimento Oliveira - blakjd2@[Link] - IP: [Link]
28/04/2022 12:37 Modulo 4 - [Link] - Colaboratory
df_resp_pca = [Link]([df['y'], result_pca_df], axis=1)
df_resp_pca
y component1 component2
0 no -0.425175 -0.509855
1 no 1.005371 -0.146158
2 yes 0.265589 2.274575
3 no -0.421084 -0.115342
4 no -0.197363 0.194940
... ... ... ...
32945 no -0.379635 0.451884
32946 no 1.095991 -0.530097
32947 no -0.433674 -0.855301
32948 no -0.384307 0.361312
32949 no -0.324058 0.829408
32950 rows × 3 columns
fig = [Link](figsize= (10,10))
ax = fig.add_subplot(1,1,1)
ax.set_xlabel('Component_1', fontsize = 15)
ax.set_ylabel('Component_2', fontsize = 15)
ax.set_title('PCA 2 componentes', fontsize = 20)
targets = ['yes','no']
colors = ['r', 'b']
for target, color in zip(targets,colors):
indicesToKeep = df_resp_pca['y'] == target
[Link](df_resp_pca.loc[indicesToKeep, 'component1']
, df_resp_pca.loc[indicesToKeep, 'component2']
, c = color
, s = 50)
[Link](targets)
[Link]()
[Link] 19/21
Jardeilsom do Nascimento Oliveira - blakjd2@[Link] - IP: [Link]
28/04/2022 12:37 Modulo 4 - [Link] - Colaboratory
[Link] 20/21
Jardeilsom do Nascimento Oliveira - blakjd2@[Link] - IP: [Link]
28/04/2022 12:37 Modulo 4 - [Link] - Colaboratory
[Link] 21/21
Jardeilsom do Nascimento Oliveira - blakjd2@[Link] - IP: [Link]