Data Visualization
Data Visualization
Key skill today
“The ability to take data-to be able to understand it, to process it, to extract value
from it, to visualize it, to communicate it-that’s going to be a hugely important skill
in the next decades."
Hal Varian (Google’s Chief Economist) ([Link]
Data Visualization for a Data Scientist
1. Data Quality: Explore data quality including identifying outliers
2. Data Exploration: Understand data with visualizing ideas
3. Data Presentation: Present results
The power of Data Visualization
Consider the following data
what is the connection?
See any patterns?
In [2]: import pandas as pd
In [8]: sample = pd.read_csv('files/sample_corr.csv')
In [9]: sample
Out[9]: x y
0 1.105722 1.320945
1 1.158193 1.480131
2 1.068022 1.173479
3 1.131291 1.294706
4 1.125997 1.293024
5 1.037332 0.977393
6 1.051670 1.040798
7 0.971699 0.977604
8 1.102914 1.127956
9 1.164161 1.431070
10 1.161464 1.344481
11 1.080161 1.191159
12 0.996044 0.997308
13 1.143305 1.412850
14 1.062949 1.139761
15 1.149252 1.455886
16 1.190105 1.489407
17 1.026498 1.153031
18 1.110015 1.329586
19 1.077741 1.277995
In [ ]:
Visualizing the same data
Let's try to visualize the data
Matplotlib ([Link] is an easy to use visualization library for Python.
In Notebooks you get started with.
import [Link] as plt
%matplotlib inline
In [12]: import [Link] as plt
%matplotlib inline
In [13]: [Link](x='x', y='y')
Out[13]: <AxesSubplot:xlabel='x', ylabel='y'>
In [ ]:
What Data Visualization gives
Absorb information quickly
Improve insights
Make faster decisions
Data Quality
Is the data quality usable
Consider the dataset: files/sample_height.csv
Check for missing values
isna() ([Link] .any()
([Link] Check for any missing
values - returns True if missing values
[Link]().any()
Visualize data
Notice: you need to know something about the data
We know that it is heights of humans in centimeters
This could be checked with a histogram
In [14]: data = pd.read_csv('files/sample_height.csv')
In [15]: [Link]()
Out[15]: height
0 129.150282
1 163.277930
2 173.965641
3 168.933825
4 171.075462
In [17]: [Link]().any()
Out[17]: height False
dtype: bool
In [18]: [Link]()
Out[18]: <AxesSubplot:ylabel='Frequency'>
In [19]: data[data['height'] < 50]
Out[19]: height
17 1.913196
22 1.629159
23 1.753424
27 1.854795
50 1.914587
60 1.642295
73 1.804588
82 1.573621
91 1.550227
94 1.660700
97 1.675962
98 1.712382
In [ ]:
Identifying outliers
Consider the dataset: files/sample_age.csv
Visualize with a histogram
This gives fast insights
Describe the data
describe() ([Link]
Makes simple statistics of the DataFrame
[Link]()
In [20]: data = pd.read_csv('files/sample_age.csv')
In [21]: [Link]()
Out[21]: age
0 30.175921
1 32.002551
2 44.518393
3 56.247751
4 33.111986
In [22]: [Link]()
Out[22]: age
count 100.000000
mean 42.305997
std 29.229478
min 18.273781
25% 31.871113
50% 39.376896
75% 47.779303
max 314.000000
In [23]: [Link]()
Out[23]: <AxesSubplot:ylabel='Frequency'>
In [24]: data[data['age'] > 150]
Out[24]: age
31 314.0
Data Exploration
Data Visaulization
Absorb information quickly
Improve insights
Make faster decisions
World Bank
The World Bank ([Link] is a great source of datasets
CO2 per capita
Let's explore this dataset [Link]
([Link]
Already available here: files/WorldBank-ATM.CO2E.PC_DS2.csv
Explore typical Data Visualizations
Simple plot
Set title
Set labels
Adjust axis
Read the data
In [26]: data = pd.read_csv('files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0)
[Link]()
Out[26]: ABW AFE AFG AFW AGO ALB AND ARB ARE
Year
1960 204.631696 0.906060 0.046057 0.090880 0.100835 1.258195 NaN 0.609268 0.119037
1961 208.837879 0.922474 0.053589 0.095283 0.082204 1.374186 NaN 0.662618 0.109136
1962 226.081890 0.930816 0.073721 0.096612 0.210533 1.439956 NaN 0.727117 0.163542
1963 214.785217 0.940570 0.074161 0.112376 0.202739 1.181681 NaN 0.853116 0.175833
1964 207.626699 0.996033 0.086174 0.133258 0.213562 1.111742 NaN 0.972381 0.132815
5 rows × 266 columns
In [ ]:
Simple plot
.plot() Creates a simple plot of data
This gives you an idea of the data
In [27]: data['USA'].plot()
Out[27]: <AxesSubplot:xlabel='Year'>
In [ ]:
Adding title and labels
Arguments
title='Tilte' adds the title
xlabel='X label' adds or changes the X-label
ylabel='X label' adds or changes the Y-label
In [20]: data['USA'].plot(title='CO2 per capita in USA', ylabel='CO2 per capita')
Out[20]: <AxesSubplot:title={'center':'CO2 per capita in USA'}, xlabel='Year', ylabel='C
O2 per capita'>
In [ ]:
Adding axis range
xlim=(min, max) or xlim=min Sets the x-axis range
ylim=(min, max) or ylim=min Sets the y-axis range
In [21]: data['USA'].plot(title='CO2 per capita in USA', ylabel='CO2 per capita', ylim=0)
Out[21]: <AxesSubplot:title={'center':'CO2 per capita in USA'}, xlabel='Year', ylabel='C
O2 per capita'>
In [ ]:
Comparing data
Explore USA and WLD
In [25]: data[['USA', 'WLD']].plot(ylim=0)
Out[25]: <AxesSubplot:xlabel='Year'>
In [ ]:
Set the figure size
figsize=(width, height) in inches
In [27]: data[['USA', 'DNK', 'WLD']].plot(ylim=0, figsize=(20,6))
Out[27]: <AxesSubplot:xlabel='Year'>
In [ ]:
Bar plot
.[Link]() Create a bar plot
In [28]: data['USA'].[Link](figsize=(20,6))
Out[28]: <AxesSubplot:xlabel='Year'>
In [29]: data[['USA', 'WLD']].[Link](figsize=(20,6))
Out[29]: <AxesSubplot:xlabel='Year'>
Plot a range of data
.loc[from:to] apply this on the DataFrame to get a range (both inclusive)
In [30]: data[['USA', 'WLD']].loc[2000:].[Link](figsize=(20,6))
Out[30]: <AxesSubplot:xlabel='Year'>
In [ ]:
Histogram
.[Link]() Create a histogram
bins=<number of bins> Specify the number of bins in the histogram.
In [34]: data['USA'].[Link](figsize=(20,6), bins=7)
Out[34]: <AxesSubplot:ylabel='Frequency'>
In [ ]:
Pie chart
.[Link]() Creates a Pie Chart
In [35]: df = [Link](data=[3, 5, 7], index=['Data1', 'Data2', 'Data3'])
[Link]()
Out[35]: <AxesSubplot:ylabel='None'>
In [ ]:
Value counts and pie charts
A simple chart of values above/below a threshold
.value_counts() Counts occurences of values in a Series (or DataFrame column)
A few arguments to .[Link]()
colors=<list of colors>
labels=<list of labels>
title='<title>'
ylabel='<label>'
autopct='%1.1f%%' sets percentages on chart
In [43]: (data['USA'] < 17.5).value_counts().[Link](colors=['r', 'g'], labels=['>=17.5',
Out[43]: <AxesSubplot:title={'center':'CO2 per capita'}, ylabel='USA'>
In [ ]:
Scatter plot
Assume we want to investigate if GDP per capita and CO2 per capita are correlated
Data available in 'files/co2_gdp_per_capita.csv'
.[Link](x=<label>, y=<label>) Create a scatter plot
.corr() Compute pairwise correlation of columns (docs
([Link]
In [44]: data = pd.read_csv('files/co2_gdp_per_capita.csv', index_col=0)
[Link]()
Out[44]: CO2 per capita GDP per capita
AFE 0.933541 1507.861055
AFG 0.200151 568.827927
AFW 0.515544 1834.366604
AGO 0.887380 3595.106667
ALB 1.939732 4433.741739
In [46]: [Link](x='CO2 per capita', y='GDP per capita')
Out[46]: <AxesSubplot:xlabel='CO2 per capita', ylabel='GDP per capita'>
In [47]: [Link]()
Out[47]: CO2 per capita GDP per capita
CO2 per capita 1.000000 0.633178
GDP per capita 0.633178 1.000000
In [ ]:
Data Presentation
This is about making data esay to digest
The message
Assume we want to give a picture of how US CO2 per capita is compared to the rest of the world
Preparation
Let's take 2017 (as more recent data is incomplete)
What is the mean, max, and min CO2 per capital in the world
In [54]: data = pd.read_csv('files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0)
In [55]: [Link]()
Out[55]: ABW AFE AFG AFW AGO ALB AND ARB ARE
Year
1960 204.631696 0.906060 0.046057 0.090880 0.100835 1.258195 NaN 0.609268 0.119037 2.38
1961 208.837879 0.922474 0.053589 0.095283 0.082204 1.374186 NaN 0.662618 0.109136 2.45
1962 226.081890 0.930816 0.073721 0.096612 0.210533 1.439956 NaN 0.727117 0.163542 2.53
1963 214.785217 0.940570 0.074161 0.112376 0.202739 1.181681 NaN 0.853116 0.175833 2.33
1964 207.626699 0.996033 0.086174 0.133258 0.213562 1.111742 NaN 0.972381 0.132815 2.55
5 rows × 266 columns
In [ ]:
In [56]: [Link][year].describe()
Out[56]: count 239.000000
mean 4.154185
std 4.575980
min 0.028010
25% 0.851900
50% 2.667119
75% 6.158644
max 32.179371
Name: 2017, dtype: float64
In [ ]:
And in the US?
In [57]: [Link][year]['USA']
Out[57]: 14.8058824221278
In [ ]:
How can we tell a story?
US is above the mean
US is not the max
It is above 75%
Some more advanced matplotlib
In [58]: ax = [Link][year].[Link](bins=15, facecolor='green')
ax.set_xlabel('CO2 per capita')
ax.set_ylabel('Number of countries')
[Link]("USA", xy=(15, 5), xytext=(15, 30),
arrowprops=dict(arrowstyle="->",
connectionstyle="arc3"))
Out[58]: Text(15, 30, 'USA')
Creative story telling with data visualization
Check out this video [Link]
([Link]
In [60]: from [Link] import YouTubeVideo
YouTubeVideo('jbkSRLYSojo')
Out[60]:
In [ ]: