0% found this document useful (0 votes)

47 views18 pages

01 - Lesson - Visualization - Jupyter Notebook

Belajar Data Sains : Visualisasi

Uploaded by

almamalik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views18 pages

01 - Lesson - Visualization - Jupyter Notebook

Belajar Data Sains : Visualisasi

Uploaded by

almamalik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Data Visualization

Data Visualization
Key skill today

“The ability to take data-to be able to understand it, to process it, to extract value
from it, to visualize it, to communicate it-that’s going to be a hugely important skill
in the next decades."

Hal Varian (Google’s Chief Economist) ([Link]

Data Visualization for a Data Scientist

1. Data Quality: Explore data quality including identifying outliers
2. Data Exploration: Understand data with visualizing ideas
3. Data Presentation: Present results

The power of Data Visualization

Consider the following data

what is the connection?
See any patterns?

In [2]: import pandas as pd

In [8]: sample = pd.read_csv('files/sample_corr.csv')

In [9]: sample

Out[9]: x y

0 1.105722 1.320945

1 1.158193 1.480131

2 1.068022 1.173479

3 1.131291 1.294706

4 1.125997 1.293024

5 1.037332 0.977393

6 1.051670 1.040798

7 0.971699 0.977604

8 1.102914 1.127956

9 1.164161 1.431070

10 1.161464 1.344481

11 1.080161 1.191159

12 0.996044 0.997308

13 1.143305 1.412850

14 1.062949 1.139761

15 1.149252 1.455886

16 1.190105 1.489407

17 1.026498 1.153031

18 1.110015 1.329586

19 1.077741 1.277995

In [ ]:

Visualizing the same data

Let's try to visualize the data

Matplotlib ([Link] is an easy to use visualization library for Python.

In Notebooks you get started with.

import [Link] as plt

%matplotlib inline
In [12]: import [Link] as plt
%matplotlib inline

In [13]: [Link](x='x', y='y')

Out[13]: <AxesSubplot:xlabel='x', ylabel='y'>

In [ ]:

What Data Visualization gives

Absorb information quickly
Improve insights
Make faster decisions

Data Quality

Is the data quality usable

Consider the dataset: files/sample_height.csv

Check for missing values

isna() ([Link] .any()

([Link] Check for any missing
values - returns True if missing values

[Link]().any()

Visualize data
Notice: you need to know something about the data
We know that it is heights of humans in centimeters
This could be checked with a histogram

In [14]: data = pd.read_csv('files/sample_height.csv')

In [15]: [Link]()

Out[15]: height

0 129.150282

1 163.277930

2 173.965641

3 168.933825

4 171.075462

In [17]: [Link]().any()

Out[17]: height False

dtype: bool

In [18]: [Link]()

Out[18]: <AxesSubplot:ylabel='Frequency'>
In [19]: data[data['height'] < 50]

Out[19]: height

17 1.913196

22 1.629159

23 1.753424

27 1.854795

50 1.914587

60 1.642295

73 1.804588

82 1.573621

91 1.550227

94 1.660700

97 1.675962

98 1.712382

In [ ]:

Identifying outliers
Consider the dataset: files/sample_age.csv

Visualize with a histogram

This gives fast insights

Describe the data

describe() ([Link]
Makes simple statistics of the DataFrame

[Link]()

In [20]: data = pd.read_csv('files/sample_age.csv')

In [21]: [Link]()

Out[21]: age

0 30.175921

1 32.002551

2 44.518393

3 56.247751

4 33.111986

In [22]: [Link]()

Out[22]: age

count 100.000000

mean 42.305997

std 29.229478

min 18.273781

25% 31.871113

50% 39.376896

75% 47.779303

max 314.000000

In [23]: [Link]()

Out[23]: <AxesSubplot:ylabel='Frequency'>
In [24]: data[data['age'] > 150]

Out[24]: age

31 314.0

Data Exploration

Data Visaulization
Absorb information quickly
Improve insights
Make faster decisions

World Bank
The World Bank ([Link] is a great source of datasets

CO2 per capita

Let's explore this dataset [Link]

([Link]
Already available here: files/WorldBank-ATM.CO2E.PC_DS2.csv

Explore typical Data Visualizations

Simple plot
Set title
Set labels
Adjust axis

Read the data

In [26]: data = pd.read_csv('files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0)
[Link]()

Out[26]: ABW AFE AFG AFW AGO ALB AND ARB ARE

Year

1960 204.631696 0.906060 0.046057 0.090880 0.100835 1.258195 NaN 0.609268 0.119037

1961 208.837879 0.922474 0.053589 0.095283 0.082204 1.374186 NaN 0.662618 0.109136

1962 226.081890 0.930816 0.073721 0.096612 0.210533 1.439956 NaN 0.727117 0.163542

1963 214.785217 0.940570 0.074161 0.112376 0.202739 1.181681 NaN 0.853116 0.175833

1964 207.626699 0.996033 0.086174 0.133258 0.213562 1.111742 NaN 0.972381 0.132815

5 rows × 266 columns

In [ ]:

Simple plot

.plot() Creates a simple plot of data

This gives you an idea of the data

In [27]: data['USA'].plot()

Out[27]: <AxesSubplot:xlabel='Year'>

In [ ]:

Adding title and labels

Arguments

title='Tilte' adds the title

xlabel='X label' adds or changes the X-label
ylabel='X label' adds or changes the Y-label

In [20]: data['USA'].plot(title='CO2 per capita in USA', ylabel='CO2 per capita')

Out[20]: <AxesSubplot:title={'center':'CO2 per capita in USA'}, xlabel='Year', ylabel='C

O2 per capita'>

In [ ]:

Adding axis range

xlim=(min, max) or xlim=min Sets the x-axis range

ylim=(min, max) or ylim=min Sets the y-axis range
In [21]: data['USA'].plot(title='CO2 per capita in USA', ylabel='CO2 per capita', ylim=0)

Out[21]: <AxesSubplot:title={'center':'CO2 per capita in USA'}, xlabel='Year', ylabel='C

O2 per capita'>

In [ ]:

Comparing data
Explore USA and WLD
In [25]: data[['USA', 'WLD']].plot(ylim=0)

Out[25]: <AxesSubplot:xlabel='Year'>

In [ ]:

Set the figure size

figsize=(width, height) in inches

In [27]: data[['USA', 'DNK', 'WLD']].plot(ylim=0, figsize=(20,6))

Out[27]: <AxesSubplot:xlabel='Year'>

In [ ]:

Bar plot
.[Link]() Create a bar plot

In [28]: data['USA'].[Link](figsize=(20,6))

Out[28]: <AxesSubplot:xlabel='Year'>
In [29]: data[['USA', 'WLD']].[Link](figsize=(20,6))

Out[29]: <AxesSubplot:xlabel='Year'>

Plot a range of data

.loc[from:to] apply this on the DataFrame to get a range (both inclusive)

In [30]: data[['USA', 'WLD']].loc[2000:].[Link](figsize=(20,6))

Out[30]: <AxesSubplot:xlabel='Year'>

In [ ]:

Histogram
.[Link]() Create a histogram
bins=<number of bins> Specify the number of bins in the histogram.
In [34]: data['USA'].[Link](figsize=(20,6), bins=7)

Out[34]: <AxesSubplot:ylabel='Frequency'>

In [ ]:

Pie chart
.[Link]() Creates a Pie Chart

In [35]: df = [Link](data=[3, 5, 7], index=['Data1', 'Data2', 'Data3'])

[Link]()

Out[35]: <AxesSubplot:ylabel='None'>

In [ ]:

Value counts and pie charts

A simple chart of values above/below a threshold
.value_counts() Counts occurences of values in a Series (or DataFrame column)
A few arguments to .[Link]()
colors=<list of colors>
labels=<list of labels>
title='<title>'
ylabel='<label>'
autopct='%1.1f%%' sets percentages on chart

In [43]: (data['USA'] < 17.5).value_counts().[Link](colors=['r', 'g'], labels=['>=17.5',

Out[43]: <AxesSubplot:title={'center':'CO2 per capita'}, ylabel='USA'>

In [ ]:

Scatter plot
Assume we want to investigate if GDP per capita and CO2 per capita are correlated
Data available in 'files/co2_gdp_per_capita.csv'
.[Link](x=<label>, y=<label>) Create a scatter plot
.corr() Compute pairwise correlation of columns (docs
([Link]

In [44]: data = pd.read_csv('files/co2_gdp_per_capita.csv', index_col=0)

[Link]()

Out[44]: CO2 per capita GDP per capita

AFE 0.933541 1507.861055

AFG 0.200151 568.827927

AFW 0.515544 1834.366604

AGO 0.887380 3595.106667

ALB 1.939732 4433.741739

In [46]: [Link](x='CO2 per capita', y='GDP per capita')

Out[46]: <AxesSubplot:xlabel='CO2 per capita', ylabel='GDP per capita'>

In [47]: [Link]()

Out[47]: CO2 per capita GDP per capita

CO2 per capita 1.000000 0.633178

GDP per capita 0.633178 1.000000

In [ ]:

Data Presentation
This is about making data esay to digest

The message
Assume we want to give a picture of how US CO2 per capita is compared to the rest of the world

Preparation

Let's take 2017 (as more recent data is incomplete)

What is the mean, max, and min CO2 per capital in the world

In [54]: data = pd.read_csv('files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0)

In [55]: [Link]()

Out[55]: ABW AFE AFG AFW AGO ALB AND ARB ARE

Year

1960 204.631696 0.906060 0.046057 0.090880 0.100835 1.258195 NaN 0.609268 0.119037 2.38

1961 208.837879 0.922474 0.053589 0.095283 0.082204 1.374186 NaN 0.662618 0.109136 2.45

1962 226.081890 0.930816 0.073721 0.096612 0.210533 1.439956 NaN 0.727117 0.163542 2.53

1963 214.785217 0.940570 0.074161 0.112376 0.202739 1.181681 NaN 0.853116 0.175833 2.33

1964 207.626699 0.996033 0.086174 0.133258 0.213562 1.111742 NaN 0.972381 0.132815 2.55

5 rows × 266 columns

In [ ]:

In [56]: [Link][year].describe()

Out[56]: count 239.000000

mean 4.154185
std 4.575980
min 0.028010
25% 0.851900
50% 2.667119
75% 6.158644
max 32.179371
Name: 2017, dtype: float64

In [ ]:

And in the US?

In [57]: [Link][year]['USA']

Out[57]: 14.8058824221278

In [ ]:

How can we tell a story?

US is above the mean

US is not the max
It is above 75%
Some more advanced matplotlib

In [58]: ax = [Link][year].[Link](bins=15, facecolor='green')

ax.set_xlabel('CO2 per capita')
ax.set_ylabel('Number of countries')
[Link]("USA", xy=(15, 5), xytext=(15, 30),
arrowprops=dict(arrowstyle="->",
connectionstyle="arc3"))

Out[58]: Text(15, 30, 'USA')

Creative story telling with data visualization

Check out this video [Link]
([Link]

In [60]: from [Link] import YouTubeVideo

YouTubeVideo('jbkSRLYSojo')

Out[60]:

In [ ]:

Summary: Introduction To Data Visualization Tools
No ratings yet
Summary: Introduction To Data Visualization Tools
13 pages
Pandas Complete + Visualisation Summary of IBM Visualization
No ratings yet
Pandas Complete + Visualisation Summary of IBM Visualization
21 pages
Data Visualization & Exploration Guide
No ratings yet
Data Visualization & Exploration Guide
24 pages
STQS2223 CH 4
No ratings yet
STQS2223 CH 4
30 pages
Data Unit4
No ratings yet
Data Unit4
8 pages
Data Visualization
No ratings yet
Data Visualization
10 pages
DVPD Final Lab Word PDF
No ratings yet
DVPD Final Lab Word PDF
93 pages
Data Visualization with Python Guide
No ratings yet
Data Visualization with Python Guide
35 pages
Air Quality Data Analysis Process
No ratings yet
Air Quality Data Analysis Process
8 pages
DEV Experiment No.3
No ratings yet
DEV Experiment No.3
10 pages
Extended - Case - 2 - Fellow: 1 The Adverse Health Effects of Air Pollution - Are We Making Any Progress?
No ratings yet
Extended - Case - 2 - Fellow: 1 The Adverse Health Effects of Air Pollution - Are We Making Any Progress?
61 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
22 pages
Data Visualization with Python Tutorial
100% (1)
Data Visualization with Python Tutorial
9 pages
Data Visualization
No ratings yet
Data Visualization
31 pages
Introduction To Data Visualization With Python
No ratings yet
Introduction To Data Visualization With Python
47 pages
ML Report
No ratings yet
ML Report
12 pages
2,3. Introduction Pandas & Matplotlib
No ratings yet
2,3. Introduction Pandas & Matplotlib
32 pages
Unit2 Modified
No ratings yet
Unit2 Modified
42 pages
3D Scatter Plot with Matplotlib
No ratings yet
3D Scatter Plot with Matplotlib
13 pages
Project Arsh
No ratings yet
Project Arsh
21 pages
Data Visualization
No ratings yet
Data Visualization
23 pages
Data Visualization Techniques Guide
No ratings yet
Data Visualization Techniques Guide
48 pages
Data Visualization With Python
No ratings yet
Data Visualization With Python
42 pages
Data Visualization
No ratings yet
Data Visualization
16 pages
Data Science
No ratings yet
Data Science
6 pages
BDA Seminar
No ratings yet
BDA Seminar
15 pages
Comprehensive Data Visualization With Matplotlib and Seaborn
No ratings yet
Comprehensive Data Visualization With Matplotlib and Seaborn
40 pages
4.data Visualisation v2
No ratings yet
4.data Visualisation v2
9 pages
Pandas 3-2
No ratings yet
Pandas 3-2
27 pages
Matplotlib Pandas Guide
No ratings yet
Matplotlib Pandas Guide
7 pages
Data Visualization
No ratings yet
Data Visualization
19 pages
Data Visualization With Matplotlib
No ratings yet
Data Visualization With Matplotlib
20 pages
Lab Record Dev
No ratings yet
Lab Record Dev
20 pages
Practical D.V
No ratings yet
Practical D.V
13 pages
Visualization
No ratings yet
Visualization
28 pages
Capstone Project
No ratings yet
Capstone Project
14 pages
2 Program
No ratings yet
2 Program
8 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
@PowerBI - Ir - Data Visualization Cheat Sheet
No ratings yet
@PowerBI - Ir - Data Visualization Cheat Sheet
15 pages
Data Visulization
No ratings yet
Data Visulization
2 pages
Python EDA Workshop with Olympics Data
No ratings yet
Python EDA Workshop with Olympics Data
12 pages
Data Visualization with Matplotlib
No ratings yet
Data Visualization with Matplotlib
18 pages
Mat Plot Lib
No ratings yet
Mat Plot Lib
2 pages
DV Co1 All PDF
No ratings yet
DV Co1 All PDF
196 pages
L5 6 DataViz
No ratings yet
L5 6 DataViz
79 pages
Chapt-3 Data Visualization
No ratings yet
Chapt-3 Data Visualization
73 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
ML Expt 1 Description
No ratings yet
ML Expt 1 Description
15 pages
Week13 2 Data Analysis 2
No ratings yet
Week13 2 Data Analysis 2
44 pages
Intermediate Python
No ratings yet
Intermediate Python
22 pages
Intermediate Python Data Visualization Guide
No ratings yet
Intermediate Python Data Visualization Guide
22 pages
DMV Unit-4-1 PDF
100% (1)
DMV Unit-4-1 PDF
10 pages
DVT Lab
No ratings yet
DVT Lab
15 pages
INDEX
No ratings yet
INDEX
16 pages
Machine Learning Project 3
No ratings yet
Machine Learning Project 3
74 pages
Python Data Visualization Guide
No ratings yet
Python Data Visualization Guide
34 pages
Unit-5 New
No ratings yet
Unit-5 New
31 pages
ADSP-15 Finite Wordlength Effects
No ratings yet
ADSP-15 Finite Wordlength Effects
64 pages
Operating Sistem Chapter 7
No ratings yet
Operating Sistem Chapter 7
33 pages
Panduan Penjadwalan Proses Efisien
No ratings yet
Panduan Penjadwalan Proses Efisien
23 pages
Operating Sistem Chapter 6
No ratings yet
Operating Sistem Chapter 6
17 pages
Operating Sistem Chapter 5
No ratings yet
Operating Sistem Chapter 5
22 pages
Manajemen File dan Struktur Direktori
No ratings yet
Manajemen File dan Struktur Direktori
24 pages
Memahami Proses dan Thread dalam OS
No ratings yet
Memahami Proses dan Thread dalam OS
24 pages
Operating Sistem Chapter 1
No ratings yet
Operating Sistem Chapter 1
18 pages
Tutotial Powersim Studio 10 Part 1
No ratings yet
Tutotial Powersim Studio 10 Part 1
28 pages
Proses dan Thread dalam OS
No ratings yet
Proses dan Thread dalam OS
25 pages
Matplotlib Tutorial Guide
No ratings yet
Matplotlib Tutorial Guide
75 pages
Studio 2003 Users Manual 01
No ratings yet
Studio 2003 Users Manual 01
28 pages
Practice Six Steps
No ratings yet
Practice Six Steps
4 pages
A System Dynamic Simulation Model
No ratings yet
A System Dynamic Simulation Model
62 pages
System Dynamics in Healthcare Modeling
No ratings yet
System Dynamics in Healthcare Modeling
8 pages
Vegetable and Fruit Sales Data
No ratings yet
Vegetable and Fruit Sales Data
5 pages
Code Crackers Booklet
No ratings yet
Code Crackers Booklet
10 pages
Scratch Dance Project Guide
100% (1)
Scratch Dance Project Guide
21 pages
December 2021 Account Statement
No ratings yet
December 2021 Account Statement
2 pages
TX Dps CHL Application Questions
No ratings yet
TX Dps CHL Application Questions
11 pages
Contact Us DPS
No ratings yet
Contact Us DPS
1 page
(BAS) Week 6. Economic Systems
No ratings yet
(BAS) Week 6. Economic Systems
25 pages
Chapter 3
No ratings yet
Chapter 3
45 pages
Catalog
No ratings yet
Catalog
554 pages
Chapter 1 Bisness
No ratings yet
Chapter 1 Bisness
10 pages
Tech Entrepreneur & App Developer
No ratings yet
Tech Entrepreneur & App Developer
1 page
HR Excel Dashboard Templates 01
No ratings yet
HR Excel Dashboard Templates 01
17 pages
1668887handson Quantum Machine Learning With Python Volume 1 Get Started DR Frank Zickert PDF Download
100% (3)
1668887handson Quantum Machine Learning With Python Volume 1 Get Started DR Frank Zickert PDF Download
76 pages
LS-VINA Cable & System Can Prove The Quality Through Many Test Certificates Which 3 Party Lab Issued For HV (Up To 220kV) and MV and FR/FRT Cable
No ratings yet
LS-VINA Cable & System Can Prove The Quality Through Many Test Certificates Which 3 Party Lab Issued For HV (Up To 220kV) and MV and FR/FRT Cable
5 pages
Bermuda Day Book 2025 8.5x11 Compressed
No ratings yet
Bermuda Day Book 2025 8.5x11 Compressed
16 pages
DMS CCD Aged
No ratings yet
DMS CCD Aged
374 pages
Coca-Cola's Integrated Marketing Strategies
No ratings yet
Coca-Cola's Integrated Marketing Strategies
15 pages
01 Introduc
No ratings yet
01 Introduc
7 pages
Gas Coning
No ratings yet
Gas Coning
2 pages
Pitch Deck Format
No ratings yet
Pitch Deck Format
4 pages
Mag b550m Mortar
No ratings yet
Mag b550m Mortar
1 page
Work Environment & Stress Impact on Productivity
No ratings yet
Work Environment & Stress Impact on Productivity
13 pages
LSA 47.2 Alternator Technical Manual
100% (2)
LSA 47.2 Alternator Technical Manual
12 pages
Autotherm: Sterilizer Autoclave Catalog
No ratings yet
Autotherm: Sterilizer Autoclave Catalog
4 pages
CMUcam4 Guide
No ratings yet
CMUcam4 Guide
25 pages
CH17-COA10e - Parallel Processing
No ratings yet
CH17-COA10e - Parallel Processing
45 pages
Dissertation Abstracts International Section B The Sciences and Engineering Journal
100% (1)
Dissertation Abstracts International Section B The Sciences and Engineering Journal
8 pages
Resistance Questions Class10th Btc-Merge
No ratings yet
Resistance Questions Class10th Btc-Merge
19 pages
Astm D6158 18
No ratings yet
Astm D6158 18
6 pages
958174D03 en Q.1 15.4
100% (1)
958174D03 en Q.1 15.4
25 pages

01 - Lesson - Visualization - Jupyter Notebook

Uploaded by

01 - Lesson - Visualization - Jupyter Notebook

Uploaded by

Data Visualization

Hal Varian (Google’s Chief Economist) ([Link]

Data Visualization for a Data Scientist

The power of Data Visualization

Consider the following data

In [2]: import pandas as pd

In [8]: sample = pd.read_csv('files/sample_corr.csv')

Visualizing the same data

Matplotlib ([Link] is an easy to use visualization library for Python.

In Notebooks you get started with.

import [Link] as plt

In [13]: [Link](x='x', y='y')

Out[13]: <AxesSubplot:xlabel='x', ylabel='y'>

What Data Visualization gives

Is the data quality usable

Check for missing values

isna() ([Link] .any()

In [14]: data = pd.read_csv('files/sample_height.csv')

Out[17]: height False

Visualize with a histogram

This gives fast insights

Describe the data

In [20]: data = pd.read_csv('files/sample_age.csv')

CO2 per capita

Let's explore this dataset [Link]

Explore typical Data Visualizations

Read the data

5 rows × 266 columns

.plot() Creates a simple plot of data

Adding title and labels

title='Tilte' adds the title

In [20]: data['USA'].plot(title='CO2 per capita in USA', ylabel='CO2 per capita')

Out[20]: <AxesSubplot:title={'center':'CO2 per capita in USA'}, xlabel='Year', ylabel='C

Adding axis range

xlim=(min, max) or xlim=min Sets the x-axis range

Out[21]: <AxesSubplot:title={'center':'CO2 per capita in USA'}, xlabel='Year', ylabel='C

Set the figure size

figsize=(width, height) in inches

Plot a range of data

In [30]: data[['USA', 'WLD']].loc[2000:].[Link](figsize=(20,6))

In [35]: df = [Link](data=[3, 5, 7], index=['Data1', 'Data2', 'Data3'])

Value counts and pie charts

In [43]: (data['USA'] < 17.5).value_counts().[Link](colors=['r', 'g'], labels=['>=17.5',

Out[43]: <AxesSubplot:title={'center':'CO2 per capita'}, ylabel='USA'>

In [44]: data = pd.read_csv('files/co2_gdp_per_capita.csv', index_col=0)

Out[44]: CO2 per capita GDP per capita

AFE 0.933541 1507.861055

AFG 0.200151 568.827927

AFW 0.515544 1834.366604

AGO 0.887380 3595.106667

ALB 1.939732 4433.741739

Out[46]: <AxesSubplot:xlabel='CO2 per capita', ylabel='GDP per capita'>

Out[47]: CO2 per capita GDP per capita

CO2 per capita 1.000000 0.633178

GDP per capita 0.633178 1.000000

Let's take 2017 (as more recent data is incomplete)

In [54]: data = pd.read_csv('files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0)

5 rows × 266 columns

Out[56]: count 239.000000

And in the US?

How can we tell a story?

US is above the mean

In [58]: ax = [Link][year].[Link](bins=15, facecolor='green')

Out[58]: Text(15, 30, 'USA')

Creative story telling with data visualization

In [60]: from [Link] import YouTubeVideo

You might also like