10/1/2021 laboratorio 9 - Jupyter Notebook
Laboratorio 9
Este es un proyecto de clasificación binaria que tiene como objetivo predecir si la carrera de un
jugador de la NBA durará más de 5 años o no. Los datos incluyen las métricas de rendimiento de
los jugadores de la NBA en su carrera, mediante las cuales se deben hacer las predicciones.
Información de atributos
Name :Nombre
GP :Juegos jugados
MIN :Minutos jugados
PTS :Puntos por partido
FGM :Objetivos de campo realizados
FGA :Objetivos de campo intentados
FG% :Porcentaje de gol de campo
3P Mode :3 puntos hechos
3PA :Intentos de 3 puntos
3P% :3 puntos por ciento
FTM :Tiro libre hecho
FTA :Intentos de lanzamiento libre
FT% :Porcentaje de lanzamiento libre
OREB :Rebotes ofensivos
DREB :Rebotes defensivos
REB :Rebotes
AST :Asistencias
STL :Roba
BLK :Bloques
TOV :Pérdidas de balón
TARGET_5Yrs :Variable predictora (1-Si la duración de la carrera>= 5 años) y (0-si<5)
localhost:8888/notebooks/laboratorio [Link]# 1/7
10/1/2021 laboratorio 9 - Jupyter Notebook
In [3]: 1 import numpy as np
2 import pandas as pd
3
4 df_arrest = pd.read_csv('C:/Users/Juan Carlos/Desktop/Python/9NA CLASE/nba_l
5
6 df_arrest["TARGET_5Yrs"]=df_arrest["TARGET_5Yrs"].astype('int64')
7 df_arrest.info()
<class '[Link]'>
RangeIndex: 1329 entries, 0 to 1328
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 1329 non-null object
1 GP 1329 non-null int64
2 MIN 1329 non-null float64
3 PTS 1329 non-null float64
4 FGM 1329 non-null float64
5 FGA 1329 non-null float64
6 FG% 1329 non-null float64
7 3P Made 1329 non-null float64
8 3PA 1329 non-null float64
9 3P% 1329 non-null float64
10 FTM 1329 non-null float64
11 FTA 1329 non-null float64
12 FT% 1329 non-null float64
13 OREB 1329 non-null float64
14 DREB 1329 non-null float64
15 REB 1329 non-null float64
16 AST 1329 non-null float64
17 STL 1329 non-null float64
18 BLK 1329 non-null float64
19 TOV 1329 non-null float64
20 TARGET_5Yrs 1329 non-null int64
dtypes: float64(18), int64(2), object(1)
memory usage: 218.2+ KB
Actividad
1. Estandarizar solo las variables continuas
2. Grafico PCA
3. Kriterio de keiser con la varianza explicada de los 3 primeros componentes
In [4]: 1 from [Link] import StandardScaler
2 #Separando las variables continuas
3 continuas = ['GP','MIN','PTS','FGM','FGA','FG%','3P Made','3PA','3P%','FTM',
4 ,'DREB','REB','AST','STL','BLK','TOV'
5 ]
6 x = df_arrest.loc[:, continuas].values
7 #Separando las variable target
8 y = df_arrest['TARGET_5Yrs'].tolist()
Estandarizar solo las variables continuas
localhost:8888/notebooks/laboratorio [Link]# 2/7
10/1/2021 laboratorio 9 - Jupyter Notebook
In [11]: 1 from sklearn.model_selection import train_test_split
In [12]: 1 X_train, X_test, y_train, y_test = \
2 train_test_split(x,
3 y,
4 test_size=0.3,
5 stratify=y,
6 random_state=0)
In [9]: 1 from [Link] import StandardScaler
In [10]: 1 sc = StandardScaler()
In [13]: 1 X_train_std = sc.fit_transform(X_train)
2 X_test_std = [Link](X_test)
In [14]: 1 df_cont=[Link](X_train_std,
2 columns=['GP','MIN','PTS','FGM','FGA','FG%','3P Made','
3 ,'DREB','REB','AST','STL','BLK','TOV'])
4 df_cont.head()
Out[14]:
GP MIN PTS FGM FGA FG% 3P Made 3PA 3P%
0 -0.719662 -0.995681 -1.118599 -1.139285 -0.943118 -2.419002 -0.652122 -0.739133 -1.194836
1 -2.848945 -0.842179 -1.005949 -0.906612 -1.134714 2.561225 -0.652122 -0.739133 -1.194836
2 0.143561 -0.440712 -0.487761 -0.441266 -0.477813 0.394503 -0.652122 -0.739133 -1.194836
3 -0.489469 -0.606022 -0.915829 -0.906612 -1.079972 1.494033 -0.652122 -0.739133 -1.194836
4 -0.662114 -1.007489 -1.118599 -1.139285 -1.079972 -1.853067 -0.390259 -0.176618 0.133145
Grafico PCA
localhost:8888/notebooks/laboratorio [Link]# 3/7
10/1/2021 laboratorio 9 - Jupyter Notebook
In [16]: 1 import numpy as np
2 import math as math
3 import [Link] as stats
4
5 df_corr = df_cont.corr(method="pearson")
6 df_corr
Out[16]:
GM FGA FG% 3P Made 3PA 3P% FTM FTA FT% OREB
23 0.521321 0.292365 0.118687 0.109922 0.028166 0.482238 0.477561 0.218018 0.398992
12 0.910997 0.225965 0.402427 0.411432 0.141625 0.793439 0.783499 0.243309 0.589434
10 0.980357 0.269423 0.359553 0.365774 0.129978 0.900860 0.885856 0.268939 0.584093
00 0.979760 0.306244 0.301133 0.307563 0.097021 0.854418 0.847581 0.235479 0.607777
60 1.000000 0.144416 0.406311 0.425208 0.177522 0.834401 0.813735 0.278319 0.514282
44 0.144416 1.000000 -0.295126 -0.348508 -0.348601 0.258110 0.310051 -0.120935 0.520859
33 0.406311 -0.295126 1.000000 0.983250 0.578534 0.184443 0.119222 0.309173 -0.209131
63 0.425208 -0.348508 0.983250 1.000000 0.571064 0.194746 0.128305 0.310850 -0.222973
21 0.177522 -0.348601 0.578534 0.571064 1.000000 0.022762 -0.040700 0.313748 -0.306656 -
18 0.834401 0.258110 0.184443 0.194746 0.022762 1.000000 0.980457 0.270060 0.585681
81 0.813735 0.310051 0.119222 0.128305 -0.040700 0.980457 1.000000 0.130429 0.654214
79 0.278319 -0.120935 0.309173 0.310850 0.313748 0.270060 0.130429 1.000000 -0.122680 -
77 0.514282 0.520859 -0.209131 -0.222973 -0.306656 0.585681 0.654214 -0.122680 1.000000
71 0.649516 0.426626 0.031373 0.023245 -0.142314 0.665563 0.713526 -0.008982 0.843561
62 0.623150 0.479160 -0.058941 -0.069669 -0.209874 0.661342 0.719204 -0.053343 0.934889
29 0.571567 -0.104090 0.412360 0.440233 0.260831 0.463799 0.413687 0.310411 -0.000313
57 0.688775 0.074667 0.342552 0.367299 0.180276 0.595249 0.577649 0.199057 0.305876
46 0.344895 0.397967 -0.150344 -0.166188 -0.252412 0.443802 0.503856 -0.142407 0.633312
44 0.849120 0.133424 0.295542 0.316605 0.096108 0.808251 0.801566 0.219918 0.427325
In [17]: 1 import numpy as np
2 import [Link] as plt
3
4 cov_mat = [Link](X_train_std.T)
In [18]: 1 eigen_vals, eigen_vecs = [Link](cov_mat)
localhost:8888/notebooks/laboratorio [Link]# 4/7
10/1/2021 laboratorio 9 - Jupyter Notebook
In [19]: 1 print('\nEigenvalues \n%s' % eigen_vals)
Eigenvalues
[9.62886703e+00 3.88752049e+00 1.14785067e+00 8.82468378e-01
7.41793351e-01 5.71211935e-01 5.04693504e-01 4.58072858e-01
4.23206166e-01 2.50445431e-01 2.36815437e-01 1.01482688e-01
1.05137644e-01 5.33749414e-02 1.43320136e-02 7.94217812e-03
4.71724574e-03 3.49739946e-04 1.70394948e-04]
In [20]: 1 from [Link] import PCA
In [21]: 1 pca = PCA()
2 X_train_pca = pca.fit_transform(X_train_std)
3 pca.explained_variance_ratio_
Out[21]: array([5.06237548e-01, 2.04386335e-01, 6.03482329e-02, 4.63957625e-02,
3.89997749e-02, 3.00314594e-02, 2.65342538e-02, 2.40831740e-02,
2.22500582e-02, 1.31671650e-02, 1.24505682e-02, 5.52761017e-03,
5.33545087e-03, 2.80618679e-03, 7.53505411e-04, 4.17559902e-04,
2.48009128e-04, 1.83875727e-05, 8.95851200e-06])
In [23]: 1 [Link](pca.explained_variance_ratio_)[5]
Out[23]: 0.886399112348117
In [25]: 1 import [Link] as plt
2 [Link](range(1, 20), pca.explained_variance_ratio_, alpha=0.5, align='cente
3 [Link](range(1, 20), [Link](pca.explained_variance_ratio_), where='mid'
4 [Link]('Relación de varianza explicada')
5 [Link]('Componentes principales')
6
7 [Link]()
In [26]: 1 pca = PCA(n_components=3)
2 X_std = pca.fit_transform(X_train_std)
localhost:8888/notebooks/laboratorio [Link]# 5/7
10/1/2021 laboratorio 9 - Jupyter Notebook
In [28]: 1 df_x =[Link](X_std)
2 df_x.columns = ['PC1', 'PC2','PC3']
3 df_x.head()
Out[28]:
PC1 PC2 PC3
0 -3.391530 -0.864992 -0.320351
1 -2.910358 -3.291920 -1.126430
2 -1.007929 -1.970897 0.059549
3 -1.918537 -2.614837 -0.269469
4 -3.354599 0.922713 0.362723
In [30]: 1 df_y = [Link](y_train)
2 df_y.columns = ['ARGET_5Yrs']
3 df_y.head()
Out[30]:
ARGET_5Yrs
0 0
1 0
2 1
3 0
4 1
In [31]: 1 df_rd = [Link]([df_x, df_y], axis=1)
2 df_rd.head(10)
Out[31]:
PC1 PC2 PC3 ARGET_5Yrs
0 -3.391530 -0.864992 -0.320351 0
1 -2.910358 -3.291920 -1.126430 0
2 -1.007929 -1.970897 0.059549 1
3 -1.918537 -2.614837 -0.269469 0
4 -3.354599 0.922713 0.362723 1
5 -1.726465 2.121716 -1.083186 0
6 0.030791 -2.792813 -0.359631 1
7 -4.168896 -0.926062 -0.169454 1
8 -0.659899 -2.469046 0.034734 1
9 0.788681 -2.029191 -0.268184 1
localhost:8888/notebooks/laboratorio [Link]# 6/7
10/1/2021 laboratorio 9 - Jupyter Notebook
localhost:8888/notebooks/laboratorio [Link]# 7/7