Data Frames-2
Data Frames-2
1 Introduction
Dans cette partie du cours nous allons découvrir ensemble les fonctionnalités de pandas qui per-
mettent l’analyse des données.
Les données que nous allons utiliser dans notre cours sont les données concernant les bilionnaires
du monde.
Notre base de donnée contient des informations sur ces bilionnaires tels que leurs industries, leurs
informations personnelles et leur fortune.
[8]: data.head()
1
4 United States … 7.5 $21,427,700,000,000
gross_tertiary_education_enrollment \
0 65.6
1 88.2
2 88.2
3 88.2
4 88.2
gross_primary_education_enrollment_country life_expectancy_country \
0 102.5 82.5
1 101.8 78.5
2 101.8 78.5
3 101.8 78.5
4 101.8 78.5
latitude_country longitude_country
0 46.227638 2.213749
1 37.090240 -95.712891
2 37.090240 -95.712891
3 37.090240 -95.712891
4 37.090240 -95.712891
[5 rows x 35 columns]
[10]: data.tail()
2
industries countryOfCitizenship … cpi_change_country \
2635 Healthcare China … 2.9
2636 Food & Beverage United States … 7.5
2637 Manufacturing China … 2.9
2638 Real Estate China … 2.9
2639 Diversified Philippines … 2.5
gdp_country gross_tertiary_education_enrollment \
2635 $19,910,000,000,000 50.6
2636 $21,427,700,000,000 88.2
2637 $19,910,000,000,000 50.6
2638 $19,910,000,000,000 50.6
2639 $376,795,508,680 35.5
gross_primary_education_enrollment_country life_expectancy_country \
2635 100.2 77.0
2636 101.8 78.5
2637 100.2 77.0
2638 100.2 77.0
2639 107.5 71.1
latitude_country longitude_country
2635 35.861660 104.195397
2636 37.090240 -95.712891
2637 35.861660 104.195397
2638 35.861660 104.195397
2639 12.879721 121.774017
[5 rows x 35 columns]
[11]: data.shape
[12]: data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2640 entries, 0 to 2639
Data columns (total 35 columns):
# Column Non-Null Count Dtype
3
--- ------ -------------- -----
0 rank 2640 non-null int64
1 finalWorth 2640 non-null int64
2 category 2640 non-null object
3 personName 2640 non-null object
4 age 2575 non-null float64
5 country 2602 non-null object
6 city 2568 non-null object
7 source 2640 non-null object
8 industries 2640 non-null object
9 countryOfCitizenship 2640 non-null object
10 organization 325 non-null object
11 selfMade 2640 non-null bool
12 status 2640 non-null object
13 gender 2640 non-null object
14 birthDate 2564 non-null object
15 lastName 2640 non-null object
16 firstName 2637 non-null object
17 title 339 non-null object
18 date 2640 non-null object
19 state 753 non-null object
20 residenceStateRegion 747 non-null object
21 birthYear 2564 non-null float64
22 birthMonth 2564 non-null float64
23 birthDay 2564 non-null float64
24 cpi_country 2456 non-null float64
25 cpi_change_country 2456 non-null float64
26 gdp_country 2476 non-null object
27 gross_tertiary_education_enrollment 2458 non-null float64
28 gross_primary_education_enrollment_country 2459 non-null float64
29 life_expectancy_country 2458 non-null float64
30 tax_revenue_country_country 2457 non-null float64
31 total_tax_rate_country 2458 non-null float64
32 population_country 2476 non-null float64
33 latitude_country 2476 non-null float64
34 longitude_country 2476 non-null float64
dtypes: bool(1), float64(14), int64(2), object(18)
memory usage: 704.0+ KB
[13]: data.describe()
4
mean 1289.159091 4623.787879 65.140194 1957.183307 5.740250
std 739.693726 9834.240939 13.258098 13.282516 3.710085
min 1.000000 1000.000000 18.000000 1921.000000 1.000000
25% 659.000000 1500.000000 56.000000 1948.000000 2.000000
50% 1312.000000 2300.000000 65.000000 1957.000000 6.000000
75% 1905.000000 4200.000000 75.000000 1966.000000 9.000000
max 2540.000000 211000.000000 101.000000 2004.000000 12.000000
gross_tertiary_education_enrollment \
count 2458.000000
mean 67.225671
std 21.343426
min 4.000000
25% 50.600000
50% 65.600000
75% 88.200000
max 136.600000
gross_primary_education_enrollment_country life_expectancy_country \
count 2459.000000 2458.000000
mean 102.858520 78.122823
std 4.710977 3.730099
min 84.700000 54.300000
25% 100.200000 77.000000
50% 101.800000 78.500000
75% 102.600000 80.900000
max 142.100000 84.200000
tax_revenue_country_country total_tax_rate_country \
count 2457.000000 2458.000000
mean 12.546235 43.963344
std 5.368625 12.145296
min 0.100000 9.900000
25% 9.600000 36.600000
50% 9.600000 41.200000
75% 12.800000 59.100000
max 37.200000 106.300000
5
population_country latitude_country longitude_country
count 2.476000e+03 2476.000000 2476.000000
mean 5.102053e+08 34.903592 12.583156
std 5.542447e+08 17.003497 86.762989
min 3.801900e+04 -40.900557 -106.346771
25% 6.683440e+07 35.861660 -95.712891
50% 3.282395e+08 37.090240 10.451526
75% 1.366418e+09 40.463667 104.195397
max 1.397715e+09 61.924110 174.885971
Dans ce cas la méthode describe n’affiche que des statistiques sur les données de type numérique.
Les données de type object ont été exclus automatiquement. Pour ajouter tous les champs, il suffit
d’ajouter la commande include = “all” dans la commande describe
6
mean NaN NaN … 4.364169
std NaN NaN … 3.623763
min NaN NaN … -1.900000
25% NaN NaN … 1.700000
50% NaN NaN … 2.900000
75% NaN NaN … 7.500000
max NaN NaN … 53.500000
gdp_country gross_tertiary_education_enrollment \
count 2476 2458.000000
unique 68 NaN
top $21,427,700,000,000 NaN
freq 754 NaN
mean NaN 67.225671
std NaN 21.343426
min NaN 4.000000
25% NaN 50.600000
50% NaN 65.600000
75% NaN 88.200000
max NaN 136.600000
gross_primary_education_enrollment_country life_expectancy_country \
count 2459.000000 2458.000000
unique NaN NaN
top NaN NaN
freq NaN NaN
mean 102.858520 78.122823
std 4.710977 3.730099
min 84.700000 54.300000
25% 100.200000 77.000000
50% 101.800000 78.500000
75% 102.600000 80.900000
max 142.100000 84.200000
7
latitude_country longitude_country
count 2476.000000 2476.000000
unique NaN NaN
top NaN NaN
freq NaN NaN
mean 34.903592 12.583156
std 17.003497 86.762989
min -40.900557 -106.346771
25% 35.861660 -95.712891
50% 37.090240 10.451526
75% 40.463667 104.195397
max 61.924110 174.885971
Nous pouvons aussi exclure certain type de describe en ajoutant la commande exclude
8
top Finance & Investments United States … NaN
freq 372 735 … NaN
mean NaN NaN … 4.364169
std NaN NaN … 3.623763
min NaN NaN … -1.900000
25% NaN NaN … 1.700000
50% NaN NaN … 2.900000
75% NaN NaN … 7.500000
max NaN NaN … 53.500000
gdp_country gross_tertiary_education_enrollment \
count 2476 2458.000000
unique 68 NaN
top $21,427,700,000,000 NaN
freq 754 NaN
mean NaN 67.225671
std NaN 21.343426
min NaN 4.000000
25% NaN 50.600000
50% NaN 65.600000
75% NaN 88.200000
max NaN 136.600000
gross_primary_education_enrollment_country life_expectancy_country \
count 2459.000000 2458.000000
unique NaN NaN
top NaN NaN
freq NaN NaN
mean 102.858520 78.122823
std 4.710977 3.730099
min 84.700000 54.300000
25% 100.200000 77.000000
50% 101.800000 78.500000
75% 102.600000 80.900000
max 142.100000 84.200000
9
max 37.200000 106.300000 1.397715e+09
latitude_country longitude_country
count 2476.000000 2476.000000
unique NaN NaN
top NaN NaN
freq NaN NaN
mean 34.903592 12.583156
std 17.003497 86.762989
min -40.900557 -106.346771
25% 35.861660 -95.712891
50% 37.090240 10.451526
75% 40.463667 104.195397
max 61.924110 174.885971
Nous pouvons aussi personnaliser les percentiles que nous avons déjà dans la commande describe
[16]: data.describe(percentiles=[.01,.10,.90,.99])
gross_tertiary_education_enrollment \
count 2458.000000
mean 67.225671
10
std 21.343426
min 4.000000
1% 23.900000
10% 36.300000
50% 65.600000
90% 88.200000
99% 113.100000
max 136.600000
gross_primary_education_enrollment_country life_expectancy_country \
count 2459.000000 2458.000000
mean 102.858520 78.122823
std 4.710977 3.730099
min 84.700000 54.300000
1% 93.200000 69.400000
10% 100.200000 72.700000
50% 101.800000 78.500000
90% 108.400000 82.800000
99% 126.600000 84.200000
max 142.100000 84.200000
tax_revenue_country_country total_tax_rate_country \
count 2457.000000 2458.000000
mean 12.546235 43.963344
std 5.368625 12.145296
min 0.100000 9.900000
1% 9.400000 20.742000
10% 9.400000 29.500000
50% 9.600000 41.200000
90% 23.100000 59.200000
99% 27.900000 65.100000
max 37.200000 106.300000
11
1.2 Gérer les valeurs manquantes
Pandas utilise la valeur NaN (Not a Number) pour réprésenter les valeurs manquantes dans un jeu
de données.
Une fonction Pandas que l’on peut utiliser pour savoir si certaines données sont manquantes dans
le Dataframe est la fonction isnull() : cette fonction retourne True lorsque la valeur testée est
manquante (NaN) et False sinon.
[17]: data.isnull()
gross_tertiary_education_enrollment \
0 False
1 False
2 False
3 False
4 False
… …
2635 False
2636 False
2637 False
2638 False
12
2639 False
gross_primary_education_enrollment_country life_expectancy_country \
0 False False
1 False False
2 False False
3 False False
4 False False
… … …
2635 False False
2636 False False
2637 False False
2638 False False
2639 False False
latitude_country longitude_country
0 False False
1 False False
2 False False
3 False False
4 False False
… … …
2635 False False
2636 False False
2637 False False
2638 False False
2639 False False
rank 0
finalWorth 0
13
category 0
personName 0
age 65
country 38
city 72
source 0
industries 0
countryOfCitizenship 0
organization 2315
selfMade 0
status 0
gender 0
birthDate 76
lastName 0
firstName 3
title 2301
date 0
state 1887
residenceStateRegion 1893
birthYear 76
birthMonth 76
birthDay 76
cpi_country 184
cpi_change_country 184
gdp_country 164
gross_tertiary_education_enrollment 182
gross_primary_education_enrollment_country 181
life_expectancy_country 182
tax_revenue_country_country 183
total_tax_rate_country 182
population_country 164
latitude_country 164
longitude_country 164
dtype: int64
Dans un premier temps, il peut être souhaitable de filtrer/retirer les valeurs manquantes dans le
dataframe.
Pour cela on utilise la fonction dropna(). Cette fonction possède plusieurs arguments, permettant
de moduler le niveau de filtrage des valeurs manquantes :
axis= (0 ou 1) : filtrage des valeurs manquantes selon les lignes (axis = 0) ou les colonnes (axis=1).
Valeur par défaut : axis=0. how=all : argument permettant de ne retirer que les lignes (pour
axis=0) ou toutes les colonnes (pour axis=1) du dataframe pour lequelles toutes les valeurs sont
manquantes. → Attention : si cet argument n’est pas spécifié, chaque ligne (resp. chaque colonne)
contenant au moins une donnée manquante sera supprimée !
inplace=True : applique directement la modification dans le dataframe (sans créer un autre
dataframe). thresh= (nombre entier n) : permet de ne conserver que les lignes (resp. les colonnes)
contenant au moins n valeurs non manquantes.
14
[22]: data.dropna(inplace=True)
data
gdp_country gross_tertiary_education_enrollment \
1 $21,427,700,000,000 88.2
2 $21,427,700,000,000 88.2
3 $21,427,700,000,000 88.2
4 $21,427,700,000,000 88.2
15
5 $21,427,700,000,000 88.2
… … …
2519 $21,427,700,000,000 88.2
2566 $21,427,700,000,000 88.2
2581 $21,427,700,000,000 88.2
2602 $21,427,700,000,000 88.2
2608 $21,427,700,000,000 88.2
gross_primary_education_enrollment_country life_expectancy_country \
1 101.8 78.5
2 101.8 78.5
3 101.8 78.5
4 101.8 78.5
5 101.8 78.5
… … …
2519 101.8 78.5
2566 101.8 78.5
2581 101.8 78.5
2602 101.8 78.5
2608 101.8 78.5
latitude_country longitude_country
1 37.09024 -95.712891
2 37.09024 -95.712891
3 37.09024 -95.712891
4 37.09024 -95.712891
5 37.09024 -95.712891
… … …
2519 37.09024 -95.712891
2566 37.09024 -95.712891
2581 37.09024 -95.712891
2602 37.09024 -95.712891
2608 37.09024 -95.712891
16
[238 rows x 35 columns]
Pour remplacer les valeurs manquantes au lieu de les supprimer nous allons utiliser la commande
fillna
gdp_country gross_tertiary_education_enrollment \
17
1 $21,427,700,000,000 88.2
2 $21,427,700,000,000 88.2
3 $21,427,700,000,000 88.2
4 $21,427,700,000,000 88.2
5 $21,427,700,000,000 88.2
… … …
2519 $21,427,700,000,000 88.2
2566 $21,427,700,000,000 88.2
2581 $21,427,700,000,000 88.2
2602 $21,427,700,000,000 88.2
2608 $21,427,700,000,000 88.2
gross_primary_education_enrollment_country life_expectancy_country \
1 101.8 78.5
2 101.8 78.5
3 101.8 78.5
4 101.8 78.5
5 101.8 78.5
… … …
2519 101.8 78.5
2566 101.8 78.5
2581 101.8 78.5
2602 101.8 78.5
2608 101.8 78.5
latitude_country longitude_country
1 37.09024 -95.712891
2 37.09024 -95.712891
3 37.09024 -95.712891
4 37.09024 -95.712891
5 37.09024 -95.712891
… … …
2519 37.09024 -95.712891
2566 37.09024 -95.712891
18
2581 37.09024 -95.712891
2602 37.09024 -95.712891
2608 37.09024 -95.712891
[ ]:
19