0 ratings 0% found this document useful (0 votes) 59 views 79 pages Data Exploration Lab
The document provides a detailed guide on installing data analysis and visualization tools such as Python and R, including step-by-step instructions for setting up Anaconda and R on Windows. It also covers practical exercises for data exploration and visualization using libraries like Pandas, Numpy, and Matplotlib, with examples of creating arrays, data frames, and various plots. The document emphasizes the importance of these tools for analyzing datasets, including email data, and visualizing insights.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here .
Available Formats
Download as PDF or read online on Scribd
Go to previous items Go to next items
Save Data Exploration Lab For Later petical Exercises °
PRACTICAL EXERCISES
gsample No.1: Install the data Analysis and Visualization tool: R/ Python /Tableau
Publie/ Power BI.
iptalling Python using Anaconda
Python is a popular language for scientific computing, and great for general-purpose
sogramming as well. Installing all of the scientific packages we use in the lesson
ipgividually can be a bit cumbersome, and therefore recommend the all-in-one installer
anaconda.
Windows
e — Open https://www.anaconda.com/products/individual in your web browser.
e Download the Anaconda Python 3 installer for Windows.
e Double-click the executable and install Python 3 using the recommended
settings. Make sure that Register Anaconda as my default Python 3.x option
is checked — it should be in the latest version of Anaconda.
«Verify the installation: click Start, search and select Anaconda Prompt from
the menu. A window should pop up where you can now type commands such
as checking your Conda installation with:
conda—help
2. Required Python Packages
The following are packages needed for this workshop:
Pandas
Jupyter notebook
Numpy
Matplotlib
Plotnine
All packages apart from plotnine will have automatically been installed with
Anaconda and we can use Anaconda as a package manager to install the missing plotnine.
Command to install
conda install -y -c conda-forge plotnine
This will then install the latest version of plotnine into your conda environment.
© scanned with Oken Scanner2 Data Exploration and Visualization
‘To import miniconda package
conda install -y numpy pandas matplotlib jupyter
conda install -c conda-forge plotnine
Activate the new environment with:
conda activate python-ecology-lesson
You can deactivate the environment with:
conda deactivate
Launch a Jupyter notebook
After installing either Anaconda or Miniconda and the workshop packages, launch
a Jupyter notebook by typing this command from the terminal:
jupyter notebook
The notebook should open automatically in your browser. If it does not or you wish
to use a different browser, open this link: http://localhost:8888.
Installing R on Windows OS
To install R on Windows OS:
Go to the CRAN website.
Click on “Download R for Windows”.
Click on “install R for the first time” link to download the R executable (exe)
file.
Run the R executable file to start installation, and allow the app to make changes
to your device.
Select the installation language.
Select Setup Language
i Select the language to use during the
installation,
© scanned with Oken Scanner; ri
“ow the uctions
f
#B Setup - Rfor Windows 4.1.2
x
Information
Please read the folowing important information before continuing. R
When you are reaidy to continue with Setup, ck Next.
[[ Weert Pustic License
| Version 2, June 1991 |
| copyright (C) 1999, 1991 Free Software Feundaten, In. |
51 Frankin St, Fifth Floor, Boston, MA 02110-1301 USA
| Everyone is permitted to copy and dstrbute verbatim copies
| of thes cense document, but changing itis not alowed,
Preamble
| the kcenses for most software ae designed to take away Your
|Ireedom to share and changeit. By contrast, the GNU General Pubic
|Lcense is intended to guarantee your freedom to share and change free
|gofiware--to make sure the oftware i free fr alts users, This
|General Pubic License apples to most ofthe Free Sofware
-
Completing the R for Windows
4.1.2 Setup Wizard
1 fected netatng fondo 4120 70
Sek ha enon eared by sera
inetated shorteuts.
dk Finch to ent Seb.
© scanned with Oken Scannera Data Exploration and Visualization
R has now been sucessfully installed on your Windows OS. Open the R GUI to Start
writing R codes.
arssonse
suey to sci mi
© scanned with Oken Scannerpractical Exercises
5
gxample No. 2: Perform explorato
data set. Export
Pandas data fra
the data,
ry data analysis (EDA) on with datasets like email
all Your emails as a dataset, import them inside a
me, visualize them and get different insights from
The CData Python Connector for E
. . mail enables you dules
to analyze and visualize live Email data you use pandas and other modu
in Python,
Stepl:
Download Email dataset from
hitps://www.kaggle.com/code/jaykrishna/topic-modeling-enron-email-dataset/data
Step 2:
Import needed package
importos,sys,email,re
importnumpyasnp
importpandasaspd :
Step 3:
# Read the data into a DataFrame
emails_df=pd.read_csv(‘../input/emails.csv’)
print(emails_df.shape)
emails_df.head()
Output
file message
0 allen-p/_sent_mail/1. Message-ID: <18782981.1075855378110.JavaMail.e.
1 allen-p/_sentmail/10. Message-ID: <15464986.1 075855378456.JavaMail.e...
2 allen-p/_sent_mail/100. Message-ID: <24216240.107585568745 1.JavaMail
3. allen-p/_sent_mail/1000. Message-ID: <13505866.1 075863688222.JavaMail
4 alien-p/_sent.mail/1001. Message-ID: <30922949.1075863688243.JavaMaile...
Step 4
## Helper functions
def get_text_from_email(msg):
‘To get the content from email objects”
f
© scanned with Oken ScannerData Exploration and Visualization
parts []
for part in msg.walk():
if part.get_content_type() =="text/plain’:
parts,append( part.get_payload() ) if
return” join(parts)
def split_email_addresses(line):
“To separate multiple email addresse.
if lin
addrs= line.split(‘,")
addrs=frozenset(map(lambda x: x.strip(), addrs))
else:
addrs=None
return addrs -
# Parse the emails into a list email objects
messages=list(map(email.message_from_string, emails_dff‘message’}))
emails_df.drop(‘message’, axis=1, inplace=True)
# Get fields from parsed email objects
keys= messages{0].keys()
for key in keys:
emails_d{[key] = [doc[key] for doc in messages]
# Parse content from emails
emails_df[‘content’] =list(map(get_text_from_email, messages))
# Split multiple email addresses
emails_df[‘From’] = emails_di[‘From’ ].map(split_email_addresses)
emails_df[‘To”] = emails_df]‘To’ ].map(split_email_addresses)
# Extract the root of ‘file’as ‘user’
emails_df[‘user’] = emails_df]‘file”].map(lambda x:x.split(‘/”)[0])
del messages
emails_df-head()
© scanned with Oken Scanneres eT
© scanned with Oken ScannerData Exploration and Visualization
Example No: 3 Working with Numpy arrays, Pandas data frames , Basic plots using
Matplotlib.
Creating a Numpy Array
# Creating a single-dimensional array
a =np.array({1.2,3]) # Calling the array function
print(a)
023)
# Creating a multi-dimensional array
2 Each set of elements within a square bracket indicates a row
# Array of two rows and two columns
b =np.array({{1,2), [3,4]))
print(b)
[11 2]
(34)
# Creating an ndarray by wrapping a list
list] = [1,2,3,4,5] # Creating a list
arr= np.array(list!) # Wrapping the list
print(arr)
[12345]
# Creating an array of numbers of a specified range
arr] =np.arange(10, 100) # Array of numbers from 10 up to and excluding 100
print(arr!)
[10 1112 13 14 15 16.17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52.53 54.55 56 57
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
82 83 84 85 86 87 88 89 90 91 92 93 94.95 96 97 98 99)’
# Creating a 5x5 array of zeroes
arr =np.zeros((5,5))
print(arr2)
© scanned with Oken Scannerpractical Exercises .
{(0. 0. 0. 0. 0.) me
eees
* Pandas is an open-source Python library providing efficient, easy-to-use data
structure and data analysis tools. The name Pandas is derived from “Panel
Data” - an Econometrics from Multidimensional Data. Pandas is well suited
for many different kinds of data:
Tabular data with heterogeneously-type columns.
e Ordered and unordered time series data.
e Arbitary matrix data with row and column labels.
e Any other form observational/statistical data sets. The data actually need not
be labeled at all to be placed into a pandas data structure.
To import library
e — importpandasaspd
series=pd.Series()# The Series() function creates a new Series
print(series) .
Series({], dtype: float64)
# Creating a series from an ndarray
# Note that indexes are a assigned automatically if not specifies
(71 arr=np.array((10,20,30,40,50])
d.Series(arr)
series
print(series!)
0 10
1 20
2 30
340
4 50
dtype: int#4#32
# Creating a series from a Python dict
# Note that the keys of the dictionary are used to assign indexes during conversion
© scanned with Oken Scanner10 Data Exploration and Visualization
data={‘a":1 0,"b":20,’c:30}
/\ __ series2=pd.Series(data)
print(series2)
a 10
b 20
© 30
dtype: int64~ \
# Retrieving a part of the series using slicing
J) print(series![1:4])
eA
1 20
DD)
3 40
dtype: intéd 22.
DataFrames
1, A DataFrame is a 2D data structure in which data is aligned in a tabular fashion
consisting of rows & columns
2. A DataFrame can be created using the following constructor -
pandas.DataFrame(data, index, dtype, copy)
3. Data can be of multiple data types such as ndarray, list, constants, series, dict
ete. .
4. Index Row and column labels of the dataframe; defaults to np.arrange(n) if no
index is passed
5, Data type of each column
6. Creates a deep copy of the data, set to false as default
Creating a DataFrame
# Converting a list into a DataFrame
}0,20,30,40]
pd.DataFrame(list!)
print(table)
0
010
1 20
2 30
3 40
_
© scanned with Oken Scannerpractical Exercises 7
J Mable1=pd.DataFrame(data)
In [27]:
# Creating a DataFrame from a list of dictionaries
data=[{‘a:1,"b':2},{¢ :4,°c":8}]
print(table1)
# NaN (not a number) is stored in areas where no data is provided
abe
# Creating a DataFrame from a list of dictionaries and accompaying row indices
_/)_ sble2=pd.DataFrame(data,index=[‘irst'second?)
SA)
# Dict keys become column lables
print(table2)
abc
first 1 2 NaN
second 2 4 8.0
10 a DataFrame
‘a'b'e'),
ve''a'))
# Converting a dictionary of seri
datal={‘one’ :pd.Series({1,2,3],ind
‘two’ :pd.Series({1,2,3,4],index=[‘a’,’b
table3=pd.DataFrame(data!)
print(table3)
# the resultant index is the union of all the series indexes passed
one two
alo J
b 2.0.2
¢ 3.0 3
d NaN 4
DataFrame - Addition & Deletion of Columns
#4 new column can be added to a DataFrame when the data is passed as a Series
table3[‘three’ }=pd.Series({10,20,30],index=[‘a’,’b’,’c"])
print(table3)
one two three
ald 1 10.0
b2.0 2 20.0
¢ 3.0 3 30.0
© scanned with Oken ScannerData Exploration and Visualization
d NaN 4 NaN
# DataFrame columns can be deleted using the del() function
deltable3[‘one’]
print(table3)
two three
a 1 10.0
b 2 20.0
© 3 30.0
d 4 NaN
# DataFrame columns can be deleted using the pop() function
table3.pop(‘two’)
print(table3)
three
a 10.0
b 20.0
c 30.0
d NaN
DataFrame - Addition & Deletion of Rows
# DataFrame rows can be selected by passing the row lable to the loc() function
print(table3.loc[‘c’])
three 30.0
Name: ¢, dtype: floaté4
# Row selection can also be done using the row index
print(table3.iloc[2])
three 30.0
Name: ¢, dtype: float4
Matplotlib
1, Matplotlib is a Python library that is specially designed for the development of
graphs, charts etc., in order to provide interactive data visualisation
2. Matplotlib is inspired from the MATLAB software and reproduces many of
it’s features
# Import Maiplotlib submodule for plotting
importmatplotlib.pyplotasplt
zs
© scanned with Oken Scannerpactical Exercises a
potting in Matplotlib
plt.plot([1,2,3,4])# List of vertical co-ordinates of the points plotied
plt.show()# Displays plot
# Implicit X-axis values from 0 to (N-1) where N is the length of the list
4.04
3.55
3.04
254
2.04 (
0.02 F105. 1054.5) 2 20°. 25 730
# We can specify the values for both axes
x srange(5) # Sequence of values for the x-axis
# X-axis values specified - [0,1,2,3,4]
plt.plot(x, [x1**2for x1 in x]) # vertical co-ordinates of the points plotted: y = x"2
plt.show()
© scanned with Oken Scanner14 Data Exploration and Visualization
Multiline Plots
linkeode
# Multiple functions can be drawn on the same plot
x=range(5)
plt.plot(x,[x 1 forx linx])
plt.plot(x,[x1*x1 forx linx])
plt.plot(x,[x1 *x1*x1 forx Linx)
plt.show()
00 05 #10 #15 #20 25 30 35 40
Adding a Legend
# Legends explain the meaning of each line in the graph
x=np.arange(5)
plt.plot(x,x,label="linear’)
plt.plot(x,x*x,label=’square”)
plt.plot(x,x*x*x,label=’cube’)
plt.grid(True)
plt.xlabel(*X-axis")
plt.ylabel('Y-axis")
plt.title(“Polynomial Graph")
plt.tegend()
plt.show()
_
© scanned with Oken Scannerpractical Exercises 15
Polynomial Graph
305 |_|
T T T T 7
0.0 gOS M1015 w20 N25 30. 35 40
Xaxis
Matplotlib provides many types of plot formats for visualising information
1, Scatter Plot
2. Histogram
3. Bar Graph
4, Pie Chart
Histogram
# Histograms display the distribution of a variable over a range of frequencies or values
.random.randn(100,100)# 100x100 array of a Gaussian distribution
plthist(y)# Function to plot the histogram takes the dataset as the parameter
pltshow()
SReRBRS
= |
© scanned with Oken ScannerData Exploration and Visualization,
16
Barchart -tmeor matplet ti pyc ae ;
A hampy as PA
plt.bar({12.3}. asp? F
plt.show()
Pie Chart
plt.figuré(figsize=(3,3))# Size of the plot in inches
x=[40,20,5]# Proportions of the sectors
labels=[‘Bikes’,’Cars’,' Buses"]
plt.pie(x,labels=labels)
plt.show()
Scatter Plot
‘ Seater plots display values for ovo ses of data, visualised asa collection of
ints
# Two Gaussion distribution plotted
© scanned with Oken Scannerpractical Exercises :
x=np.random.rand(1000) @ epost worplet ih - pyret as Pt:
y=np.random.rand(1000) .
plt.scatter(x,y) ® venpor*® Hoespy 0 OP
pit.show()
Hag,
fe
A
© scanned with Oken Scanner18 Data Exploration and Visualization
Example No.4: Explore variow:
Apply various plot features in R
ulation in R on the ‘Census Income
which contains the income information
variable and row filters in R for cleaning daty
To perform data extraction and data manip
dataset from the UCI Machine Learning Repository,
of over 48,000 individuals, taken from the 1994 US census.
To import the dataset:
consus<- _rend.esv(“C:\\Users\\Intellipaat-Team\\Desktop\\census.
income.csv”)on sample data sets and visualize.
class(census)
[1] “data.frame”
>dim(census)
[1] 3016215
>names(census)
[1] “age” “workelass” “fnlwgt” “education” — “education.num”
[6] “marital.status” “occupation” “relationship” “race” ‘sex’
[11 “capital.gain” “capital.loss” “hours.per.week” “native.country” “X”
>head(census) #First six rows and all columns
age workclass fnlwgt education education.num
139 State-gov 77516 Bachelors 13
2 50Self-emp-not-inc 83311 Bachelors 13
3 38 Private 215646 HS-grad 9
4 53 Private 234721 ith tf
5 28 Private 338409 Bachelors 13
6 37 “Private 284582 Masters 14
To remove whitespaces from the above columns,
we use the mutate if and the str_trim functions from the dplyr and the stringr
packages, respectively.
library(“dplyr”)
library(stringr)
census %>%
mutate_if(is.character, str_trim) -> census
After performing the above operation, all th i ai . ill
be removed. , all the leading and trailing whitespaces
‘0 convert the above columns back to factors to get back to the original structu!
lo get t eI
a
© scanned with Oken Scannerpractical Exercises 19
#Convert character columns back to factors
census$workclass <- as.factor(census$workclass)
census$occupation <- as.factor(census$occupation)
census$native.country <- as.factor(census$native.country)
census$education <- as.factor(census$education)
census$marital.status <- as.factor(census$marital.status)
censusSrelationship <- as.factor(censusSrelationship)
census$race <- as.factor(census$race)
census$sex <- as.factor(census$sex)
census$X <- as.factor(census$X)
Data Extraction in R
Data must be in clean and tidy format.
First, use the base R functions to extract rows and columns from a data
In this example, we will use the indexing features in R to perform data extraction
on the ‘census’ dataset.
frame.
For example:
Hselect columns age, education, sex
mycol<- o(“age”, “education”, “sex”)
>census[mycol]
age education sex
39 Bachelors Male
1
2 50 Bachelors Male
3.38 HS-grad Male
4°53 11th Male
5 28 Bachelors Female
6 37. Masters Female
# First Row and 2nd and third column
census{1, 2:3]
workclass fnlwgt
1 State-gov 77516
4 First 6 Rows and Second Column as a data frame
as.data.frame( census[1:6,2], drop=false)
census[1:10, 2]
1° State-gov
ee
© scanned with Oken Scanner20
Data Exploration and Visualization
2 Self-emp-not-ine
3 Private
4 Private
5 Private
6 Private
#Element at Sth row, tenth column
census[5,10][1]
Female
Levels: Female Male
# exclude columns age, occupation, and race
mycols<- names(census) %in% e(“age”, “occupation”, “race”)
newdata<- census[!mycols]
# exclude 3rd and Sth column
newdata<- census[c(-3,-5)]
# delete columns fnlwgt and education.num
census$fnlwegt <- censusSeducation.num <- NULL
4selecting rows based on column values
newdata<- census[ which(census$sex==’Female”
& census$age > 65), ]
© scanned with Oken ScannerPractical Exercises
21
Example No.:5. Perform Time Series Analysis and apply the vari
techniques. us visualization
Different types of visualizations time series data. They are:
Line Plots.
Histograms and Density Plots.
Box and Whisker Plots.
Heat Maps.
Lag Plots or Scatter Plots.
Autocorrelation Plots.
ayrepe
This dataset describes the minimum daily temperatures over 10 years (1981-1990)
in the city Melbourne, Australia.The units are in degrees Celsius and there are 3,650
observations. The source of the data is credited as the Australian Bureau of Meteorology.
Step 1:Download the dataset
https://github.com/jbrownlee/Datasets
daily-minimum-temperatures.csv”. (
package and read dataset +
Step 2: import
from pandas import read_csv
from matplotlib import pyplot
© scanned with Oken Scanner2 Data Exploration and Visualization
series = read_esv(‘/content/daily-min-
temperatures.csv’, header=0, index_col=0, parse_dates=True, squee7
True)
print(scrics.head())
© free panes import rend
106 taper
hands, Inder sole, parse Atenen, smmatenTas)
teatetes
Step 3: Time Series Line Plot
from pandas import read_csv
from matplotlib import pyplot
series = read_csv(‘/content/daily-min-
temperatures.csv’, header=0, index_col=0, parse_dates=True, squeeze=True)
series.plot()
pyplot.show()
© tom pentan tacort rent env
em etplct hi ingore pypiot
raed covt (entent/dably-nin:tenpersterst.ct0"
trees saueezestree)
SEP eee Perr er
Step 4: use aasnea une
series.plot(style="k.”)
pyplot.show()
_
© scanned with Oken Scannerpractical Exercises 3
2
step 5: group data by year
a Minimum Daily Temperatures dataset spans 10 years. We can group data by
year = create a line plot for cach year for direct comparison.A plot of this contrived
DataFrame is created with each column visualized as a subplot with legends removed to
cut back on the clutter,
from pand
from pandas import DataFrame
from pandas import Grouper
import read_csv
from matplotlib import pyplot
series = read_csv(‘/content/daily-min-
temperatures.csv’, header=0, index_col=0, parse_dates=True, squeeze=Truc)
groups = series.groupby(Grouper(freq="A’))
years = DataFrame()
for name, group in-groups:
years{name.year] = group.values
years.plot(subplots=True, legend=False)
pyplot.show()
(7) fron pandas: import: end cov
from pandas import OataFrane
aries + rend cor ontent/Saiy
Seige cselesereuiy(eromer( (re)
years = oataFrane()
for name, group in grvps:
years[noe year] = growpvalons
years plot(ssbplotstevt, laendsels
pyplet.show)
Step 6: Time Series Histogram and Density Plots
creates a histogram plot of the observations in t
dataset.
he Minimum Daily Temperatures
© scanned with Oken Scanner\
24 Data Exploration and Vistar,
from pandas import read_es¥
aa t a
from matplotlib import pyplo fae senipratutes. 05" fee -
i sdaily-minimum+ i,
series = read_esv(‘dai
parse_dates=True, squec? rue)
= st()
pyplot.show()
© iin aacascu'y aserad, Sinden ole, parse tnessTeue Seer
rend exec fecntent/aatly mint
Step 7: density plot of the Minimum Daily Temperatures dataset.
from pandas import read_csv
from matplotlib import pyplot
series = read_csv(‘daily-minimum-temperatures.csv’, header=0,
index_col=0, parse_dates=True, squeeze=True)
series.plot(kind="kde’)
pyplot.show()
wRone
© Irom andes import resdcav
from matpletllb Lepore pyplot
series « read.cxv("/content/dally-aln-tempersturescov'
a se fe y-aln-tewpen + beeders®, Index cols0, parse datesstrue, sqveezestrve)
© scanned with Oken Scannerpractical Exercises
25
step 8: Time Serfes Box and Whisker Plots by Interval
Box and whisker plots can
x be created and com
series, such as years, month:
S, or days,
from pandas import read_esy
from pandas import DataFrame
from pandas import Grouper
from matplotlib import pyplot
series = read_esv(‘daily- 'um-temperatures.csv’, header=0, index_col=0,
parse_dates=True, Squeeze=True)
groups = series. groupby(Grouper(freq—"A"))
years = DataFrame()
for name, group in groups:
years[name. year] = group. values
years.boxplot()
pyplot.show()
pared for each interval ina time
Ineodereoy Andon cols0, perse,satenstron, sqveetectrve)
Step 9: box and whisker plot are created for each month-column in the newly
constructed DataFrame.
# Create a boxplot of monthly data
from pandas import read_csv
from pandas import DataFrame
from pandas import Grouper
from matplotlib import pyplot
from pandas import concat :
series = read_csv(‘/content/daily-min-
© scanned with Oken Scanner26 Data Exploration and Visualization
temperatures.csv’, header=0, index_col=0, parse_dates=True, squeeze=True)
one_year = series[*1990°] ,
groups = one_year,groupby(Grouper(freq="M")) ;
({DataFrame(x[].values) for x in groups}, axis=1)
months.boxplot()
pyplot.show0
r/ic/iytbnh.2/estpacagenfetget bbe talt_ py 9s Wbaopncaioarng:Crestine nlarey fron rugand astedsemances (
TF oeeae aa t A talntancety rpsaerey) ase pater)
Step 10: Time Series Heat Maps
creating a heatmap of the Minimum Daily Temperatures data. The matshow()
function from the matplotlib library is used as no heatmap support is provided
directly in Pandas.
from pandas import read_csv
from pandas import DataFrame
from pandas import Grouper
from matplotlib import pyplot
series = read_
v(‘/content/daily-min-
temperatures.csy’, header=0, index_col=0, parse_dates=True, squecze=True)
groups = series.groupby(Grouper(freq="A’))
years = DataFrame()
for name, group in groups:
© scanned with Oken ScannerPractical Exercises
27
years[name.year] = group.values
years = years.
pyplot.matshow(years, interpolation=None, aspect="auto’)
pyplot.show()
°
from ponte inport rand coy
trom pon nport Bateheome
rom pandet Inport Greer
from satplet Lib Inport pyptet
vs Hanae Inia colon, gare tatereron, ancora)
Step 11: heat map comparing
heat map comparing the months of the year in 1990. Each column represents one
month, with rows representing the days of the month from 1 to 31.
from pandas import read_csv
from pandas import DataFrame
from pandas import Grouper
from matplotlib import pyplot
from pandas import concat
series = read_esv(‘/content/daily-min-
temperatures.csv’, header=0, index_col=0, parse_dates=True, squeeze=True)
one_year = series{‘1990"]
groups = one_year.groupby(Grouper(freq="M"))
months = concat([DataFrame(x[1 ].values) for x in groups), ax’ =1)
months = DataFrame(months)
months.columns = range(1,13)
pyplot.matshow(months, interpolation=None, aspect="auto’)
pyplot.show() :
© scanned with Oken ScannerData Exploration and Visualization
Step
Whe
ABs,
:12 Time Series Lag Scatter Plots
Ina lag plot a ball in the middle or a spread across the plot suggests a weak or no
relationship.
# create a scatter plot
from pandas import read_csv
from matplotlib import pyplot
from pandas.plotting import lag_plot
series = read_esv(‘/content/daily-min-
temperatures.csv’, header=
lag_plot(series)
pyplot.show()
, index_col=0, parse_dates=True, squeeze=True)
¥ OH erate » seater pet
fron gate apr an sv
from atgat ib Sort plot
from pads pleting Saprt agg
sche «redolent daly abn: tegaatcnsci, hander, open a
ee se 1 beter, Indes clot, pase stereo, sqnntestrv)
pnlet-had)
© scanned with Oken Scannerpractical Exercises
ship be with its lag! value,
i tion with each value in the last week.
from pandas import read_esy
from pandas import DataFrame
from pandas import concat
from matplotlib import Pyplot
from pandas.plotting import scatter_matrix
series = read_esv(*/content/daily-min-
temperatures.csv’, header=0, index_col=0,
values = DataFrame(series.values)
lags =7
columns = [values]
for i in range(1,(lags + 1):
columns.append(values.shift(i))
dataframe = concat(columns, axis=1)
columns = [‘t+1’]
for i in range(1,(lags + 1):
columns.append(‘t-’ + str(i))
dataframe.columns = columns
pyplot.figure(1)
for i in range(1,(lags + 1)):
ax = pyplot.subplot(240 + Y .
ax.set_title(‘t+1 vs t-’ + str(i ot 7
any seatuntciciaheneltel bealies y=dataframe[‘t-’+str(i)].values)
pyplot.show()
parse_dates=True, squeeze=True)
sy henderse, Antenscelet, parse aatnstran, sevetenteut)
Paes
© scanned with Oken Scanner30.
Step 13:
{ay
Data Exploration and Visualization
ime Series Autocorrelation Plots
from pandas import read_csv
from matplotlib import pyplot
from pandas.plotting import autocorre
series = read_esv(‘/eontent/daily-min-
header=0, index_col=0,
Jation_plot
parse_dates=True, squeeze=True)
temperatures.csv’,
autocorrelation_plot(series)
pyplot.show()
fron pandas ieport rend_csv
roa matplotlib ieport pyplot
fron pandas plotting isport autocorrelation plot
series = read csv( "/content/daily-win-tenperatures.c5¥"y
autocorrelation plot(series)
prlet.shod)
headers, index_cole®, porse datessTrue, squeezesTree)
© scanned with Oken Scannerpractical Exercises
Example No.6: erg a
form Data Analysis and representation i
eas on a Map using vario
n us,
ith Mouse Rollover effect, user interaction,
Multilayer interactive map
Step1 :
Folium supports
GeoPandas have
GeoDataFrame,
create a multi-la
Setup and Data
%%capture
ete.
Creatin,
built 8 Maps with multiple layers. Recent versions of
IN Support to create interactive folium ‘maps from a
er interactive map using 2 vector datasets.
Download
if ‘google.colab’ in str(get_ipython()):
‘apt install libspatialindex
‘pip install fiona shapely Pyproj rtree mapclassify
pip install geopandas
import os
import folium
dev
from folium import Figure
import geopandas as gpd
data_folder = ‘data’
output_folder = ‘output’
if not os.path.exists(data_folder):
os.mkdir(data_folder)
if not os.path.exists(output_folder):
os.mkdir(output_folder)
step 2
import the data set
def download(url):
filename = os.path,join(data_folder, os.path.basename(url))
if not os.path.exists(filename):
from urllib.request import urlretrieve
local, _ = urlretrieve(url, filename)
print(‘Downloaded * + local)
filename = ‘karnataka.gpkg’
data_url = ‘https://github.com/spatialthoughts/python-dataviz-web/raw/main/
data/osm/’
download(data_url + filename)
© scanned with Oken Scanner32
Step 3
pata Exploration and Visa,
Using GeoPandas explore() .
hod to create an interactive folium map fro,
rore() a folium object is ereated. We can sa
rs to the map.
th
We can use the explore() met!
tha
GeoDataFrame. When call exp! oh
object and use it to display or add more lay’
data_pkg_path = ‘data’
filename = ‘karnataka.gpkg’
path = os.path,join(data_pkg_path, filename) : :
roads_gdf = gpd.read_file(path, layer="karnataka_highways )
ct file(path, layer="karnataka_districts’)
districts_gdf = gpd.read_|
state_gdf'= gpd.read_file(path, layer="karnataka’)
m = districts_gdfexplore()
bounds = districts_gdf.total bounds
bounds i
output
array((74.05096229, 11.58237791, 78.58829529, 1
a 8.47673602})
The explore() function takes
amagi m
folium map to which to render the Cea we can supply an existing
igure(width=800, height=4g9) the
m= folium.Map()
é
© scanned with Oken Scannerpractical Exercises 33
m.fit_bounds({{bounds| 1],bounds{OJ}, [bounds[3],bounds{2}]})
districts_gdf.explore(m=m)
fig.add_child(m)
Output
mera. emengtennngagstaannn te gy teense
Aistedets gtfserplore (mn)
Step 5
Folium supports a variety of basemaps. Let’s change the basemap to use Stamen
Terrain tiles. Additionally, we can change the styling using the color and style_kwds
parameters.
fig = Figure(width=800, height=400)
m = folium.Map(tiles=’Stamen Terrain’)
m.fit_bounds({{boundsf !],bounds{0]}, [bounds[3],bounds[2]]])
districts_gdf.explore(
m=m,
color="black’,
style_kwds=f‘fillOpacity’: 0.3, ‘weight’: 0.5},
)
fig.add_child(m)
Step 6
The GeoDataFrame contains roads of different categories as given in the ref column.
Let’s add a category column use it to apply different styles to each category of the road.
© scanned with Oken Scannerpata Exploration and Viguy,
def get_category(row)*
ref = str(row[‘ref"])
if ‘NH’ in ref:
return ‘NH?
elif ‘SH’ in ref:
return ‘SH’
else:
return ‘NA’
roads_gdff‘category’]
= roads_gdf-apply(get_category, axis=1)
roads_gdf*
cme ena} + rut ge pat coer met)
mage
, SE es oe ie os)
at cote tho eo FF AARTRINESTRONG (5 68801 1322165, 1585m9 yy
=o ey
1 ee tna ak Sepiee-tgdaytees WON 80 F FP MATUNESTANCTS ODN DUR. Net
af oat we Legge C2 ee emcamcomn.
2 eee an ey Ga OP te gs cuaemcapiiecioenens o
25 Metoet fC) fete Ca cu tcemcstensranetnae’
a muted as BS Sct eitcetebcivosanecnae’ 2
ie mre tae awemmeme
Se ee ee igecececer ne os
Smet Se eer
Se eomereere ree
Step 7
Create Multi-layer Maps
When call explore() a folium object is created. You can save that object and add
more layers to the same object.
fig = Figure(width=800, height=400)
m = folium.Map(tiles="Stamen Terrain’)
m.fit_bounds({[bounds{1], bound:
districts_gdf-explore( {H],bounds{0}}, (bounds{3},bounds{2]]})
m=m,
color="black’
style_kwds={*fillOpacity’: 0.3, «
name="districts’, :
tooltip=False)
roads_gdf.explore(
‘weight’:0,5},
© scanned with Oken Scannerpractical Exercises 35
columi
‘category’,
categories=[‘NH", ‘SH°],
emap=[‘#1178b4", ‘#e3lale’],
categorical=True,
name=highways’
)
fig.add_child(m)
fig.add_chtld(e)
< Bsieng
SPA tenet ono 8 Sperseoal sou
© scanned with Oken Scanner36
Example No
Build cartograp
various countries ©
asemap
7.1 Cartographic visualization from b:
Step 1: Robinson Projection
{pip install "basemap 1.3.0b1"
from mpl_toolkits.basemap import BasemaP
import numpy as np
import matplotlib.pyplot as pit
# lon_0 is central longitude of projection.
# resolution = ’e” means use crude resolution coastlines.
resolution="c*)
m = Basemap(projection="robin’ ,lon_0
m.drawcoastlines()
mJillcontinents(color=’red’ ,lake_color="
# draw parallels and meridians.
m.drawparallels(np.arange(-90.,120.,30.))
m.drawmeridians(np.arange(0.,360.,60.))
m.drawmapboundary(fill_color="green’)
plt.title(“Robinson Projection”)
plt.show()
coral”)
© fron npl_teolkits.basenap inort sasemap
Arocrt numpy 25 9p
inport matplotlib.pyplot 3s plt
@ lon_@ 42 central longitude of projection.
@ rescluticn = ‘c' means use crude resolution coastlines.
f= Basemap(projection= ‘robin’ ,1en_@«2,resolution='<*)
n.dreucoastlines()
a. fi11continents(colors"red* =*corai*
S draw porellcts end meridions; hey
n.drauparallels(np.erange(-90. ,120.,2¢.))
es. dranreridions(np.arange(e. 2
n.draanapboundary (Fi11_color=' green")
plt.title("Rebinzon Prosection")]
plt.showc)
Robinson Project
© scanned with Oken Scannerpractical Exercises
step 2: Gall-Peters Projection
37
from mpl_tootkits.basemap i
f x s.basemap import Bas
import matplotlib.pyplot as vt 7
map = Basemap()
map.drawcoastlines()
plt.show()
plt.savefig(‘test.png")
eo from mpl_toolkits bas:
from a sbasemap import Ba:
import matplotlib.pyplot as plt pane
map = Basemap()
map .drawcoastlines()
pit.show()
plt.savefig(‘test.png')
step 3: Draw great circle between NY and London.
from mpl_toolkits.basemap import Basemap
import numpy as np
import matplotlib.pyplot as plt
# create new figure, axes instances.
fig=plt.figure()
ax: ig.add_axes([0.1,0.1,0.8,0.8})
# setup mercator map projection.
m = Basemap(llcrnrlon=-1 00.,llernrlat=20.,urcrnrlon=20.,urcrnrlat=60.,\
© scanned with Oken ScannerData Exploration and Visuaizays
n
38
.\
rsphere=(63781 37.00,6356752.3 142),
resolution" ,projection="mere’.\
lat_0=40.,lon_| 20.,lat_ts=20-)
# nylat, nylon are lat/lon of New York
nylat = 40.78; nylon = -73.98
# lonlat, lonlon are lat/lon of London.
Jonlat = 51.53; lonlon = 0.08
and London
# draw great circle route between NY
mdrawereatcircle(nylon,nylat,lonlon,tonlat,
m.drawcoastlines()
m.fillcontinents()
# draw parallels
m.drawparallels(np.arange(10,90,20),labels=[1,1,0,1])
# draw meridians
m.drawmeridians(np.arange(-180,180,30),labels=[1,1,0,1])
ax.set_title(*Great Circle from New York to London’)
plt.show()
linewidth=5,color="r’)
© ‘ros npl_toolkits.beseaap import sesenap
Aroort munoy 3s
"5
Arport matplotliv.pyplot as plt
fo create new Figure, axes instences.
figeplt. figured)
8.
# nylat, nylon are at/lon of new York
nylat = 29.78; nylon = -73.98
‘= lontat, Lonlen are lat/1on of London.
lonlat « $1.53; lonton « 2.03
t circle route between tty and Longon
“amparallels(np. srange(10, 90, 20 12,8,
nel (10,90, 20), 190e150[1,1,0,1])
m.drenmer idiens(np.arenge(-190, 180,30), Labels=£1,1,0, 1
n +380, 20), 11,1,0,1))
ax.set_title('Great Circle from New York to London’),
setete tot >
© scanned with Oken Scannerpractical Exercises
39
step 4: Draw day-night terminator o
import numpy as np
from mpl_toolkits.basema
import, matplotlib.pyplot a:
from datetime import datet
# miller projection
map = Basemap(projection="
# plot coastlines, draw label 1
map.drawcoastlines()
map.drawparallels(np.arange(-90,90,30),labe
map.drawmeridians(np.arange(map,lonmin,map.lonmax-+30,60),labels=[0,0,0,1])
# fill continents ‘coral’ (with zorder=0), color wet areas ‘aqua’
map.drawmapboundary(fill_color="aqua’)
map fillcontinents(color="coral’ lake_color="aqua’)
# shade the night areas, with alpha transparency so the
# map shows through. Use current time in UTC.
date = datetime.utcnow()
CS=map.nightshade(date)
plt.title(‘Day/Night Map for %s (UTC)’ % date.strftime(“%d %b %Y
%H:%M:%S”))
plt.show()
nmap,
P import Basemap
s pit
time
mill’ Jon_0=180)
meridians and parallels.
[1,0,0,0])
ser ney 0
° from mel_toolkits.Basenap ircort Basenap.
Sicistetthagaic oe
‘toate soar aie
othe poten
Bis euptoefectlN ano)
“ase contoys teal ards spares
© scanned with Oken ScannerData Exploration and Visualization
Step 5: contour lines over filled continent background
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
import numpy as np
# set up orthographic map projection with
# perspective of satellite looking down at
# use low resolution coastlines.
50N, 100W.
map = Basemap(projection="ortho’ at_0=45,lon_0=-1 00,resolution="1’)
# draw coastlines, country boundaries, fill continents.
map.drawcoastlines(linewidth=0.25)
map.drawcountries(linewidth=0.25)
map.fillcontinents(color="coral’ ,lake_color="aqua’)
# draw the edge of the map projection region (the projection limb)
map.drawmapboundary(fill_color=" green’)
# draw lat/lon grid lines every 30 degrees.
map.drawmeridians(np.arange(0,360,30))
map.drawparallels(np.arange(-90,90,30))
# make up some data on a regular lat/lon grid.
nlats = 73; nlons = 145; delta = 2.*np.pi/(nlons-1)
lats = (0.5*np.pi-delta*np.indices((nlats,nlons)){0,:,:])
lons = (delta*np.indices((nlats,nlons))[1,:,:])
wave = 0.754(np.sin(2.*lats)**8*np.cos(4.*lons))
mean = 0.5*np.cos(2.*lats)*((np.sin(2.*lats))**2 + 2.)
# compute native map projection coordinates of lat/lon grid.
x, y = map(lons*180./np.pi, lats*180./np.pi)
# contour data over the map.
cs = map.contour(x,y,wave+mean, | 5,linewidths=1.5)
pit.title(*contour lines over filled continent background’)
plt.show()
© scanned with Oken ScannerPractical Exercises
P fran Latiten grad Lined ress.
ap dramariaians(np.arange(e, 269,293}
pees droeparatteda(op.arange(-f6, 56,380)
on a regu Lanrton eet
lens © 148; ae] ‘
dana 0 Ce Sen
°C (np,
© censure native aap pragee
Step 6: Mercator Projection
from mpl_toolkits.basemap import Basemap
import numpy as np
import matplotlib.pyplot as plt
# Ilernrlat,llcrnrlon,urcrnrlat,urcrnrlon,
# are the lat/lon values of the lower left and upper right corners
# of the map.
# lat_ts is the latitude of true scale.
# resolution = ’c’ means use crude resolution coastlines.
m = Basemap(projection=’ mere’ Ilcrnrlat=-80,urcrnrlat=80,\
Iernrlon=-180,urcrnrlon=180,lat_ts=20,resolution=
m.drawcoastlines()
41
© scanned with Oken Scanner42
Data Exploration and Visualization
m-filleontinents(color=' yellow’ Jake_color=aqua’)
# draw parallels and meridians.
m.drawparallels(np.arange(-90.,91.,30.))
m.drawmeridians(np.arange(-180.,181.,60.))
m.drawmapboundary(fill_color="green”)
plt.itle(““Mereator Projection”)
plt.show()
fron apl_toolkits.basemap import Basenap
import numpy as np
import matplotlib.pyplot as pit
@ Lcrarlat, lcrarion,urcenrlat,urcrarion
* are the lat/lon values of the lower left and upper right corners
® of the map.
# latts is the latitude of true scale.
* resolution = ‘c' means use crude resolution coastlines.
= Basenap(projections‘nerc' ,licrnrlate-s2,urcrarlatese,\
Licenrlons-180,urcrnrlons180, lat_ts=2@,resolutions'c')
m.drancoastlines()
n.fillcontinents(colora'yellon’ ,lake_colors" aqua’)
# draw parallels and rerigians.
n.dranparallels(np.arange(-90.,91.,30.))
A.dranneridians(np.arange(-180.,2181.,60.))
n.drannapboundary (Fill_colors' green’)
plt.title("vercator Projection")
plt.show()
Mercator Projection
© scanned with Oken ScannerPractical Exercises
7.2 visualization for multiple datasets Indi -
Step 1: Installing GeoPandas and Shapely
!pip install geopandas
© bolo sostann gropints
Looking tr indexes: httess ‘i
© iecting geounaay AS M“OveLcRmee, bios futon, thate/la ates /licimer
Onno ES 8.28.3 4.091. rcoe- aya (1.0 8)
Re 08 6.0 M/s
pirenent already sitistiel? shapes
fearon ame SRS: SMeeepat.e in suse/tocalstibjertion t/4ist pack
berber iced FS Pandas9-0.23.0 im fuse /Local/ brayenond Hie
Oa E21 6517 cotta ny Linunit sae Atl (16.7 9)
[er a
collectieg prproyy2. 200
Pamboating psreo) 1.3.1 647 coa.nanylin20{0, eto 64.sh) (6.9.90)
eres
coldecting c1ag}9°0.5
Damlonsing Clig}-0.7.2-py3-rone-anyaatl (2-4 48)
CoereTeg Acuege batisttess shosa.7 in fusrocatabreythoms sist packages fren fionyhvmsntan) (118)
Downloating monch-2.5.0 py2.py2-none:any.wht (10 8)
Nchoe4.0 in fusr/local/lib/pytnen).776ist- packages. (ow Flonyet 8 sgeogsrdas) (2.1.2)
ecCiFL in fuse/ocal/Tib/pythond.7/dist-pachages. (Fron Flonadel, -ogecpantas) (1922.6.15)
Requirement already satisfied: setuptools im /usr/local/libyoythons.1/aist packages (from Fiondd-hot ogeerondas) (51.8.0)
Collecting CLick-pluginsyet
Cownloating €1i¢ plugine 1.1.4 py2.py):Aee-anyoabl 8 48)
Requirement already satisfied: attesscd? in fuse/local/lib/pytton.7/aist- packages (fron (on
Roquieement already satisfied: numpy2e1:12.3 4a fuse/local/Lib/python)./dst-puchages ({Foe purdasy.25. gest
eequtrement already satisfied: pytz>-2011.1 in fusr/Local/ibeython).F/dst-packapes (How f8°Ea090.25.0 rfropantis) (002-2
Requirement already satistied: python dateutil=2,7.3 in fuse lecal/Libygytton).27Gist- packages (Prom pevdanoo0.25.0 9gepanba) (2.8.2)
Installing collectes packages: munch, c1ij, chick plugins, pypro}, fis, ceopandas
Sucesstully installed eLick-plugins-1.1.1 C1igj-02.2 ona-1.8.21 geopantay0.10.2 mench-2.8.8 6
rom grceardas) (1.8.4)
Backages (from gecoandus) (1)
A-rgwopantas) (22.1.0)
!pip install pyshp
4) tpip install pyshp
i " pblictsinpley
Looking in indexes: httes://nypi.ore/simale, bttps://us-ny.thon, okg de/colap-nheels/ou
collecting pyshp
Oonntoadin
!
Installing collected packages: pyshp
successfully installed pyshp-2.3-2
ny.whl (46 KB)
} 46 kB 2.7 MB/s
none
Step 2 : Importing the libra
import numpy as np
import pandas as pd
import matplotlib.pyplot as pit
import seaborn as sns
import geopandas as gpd
import shapefile as shp
from shapely.geometry
sns.set_style(‘whitegrid’)
import Point
© scanned with Oken Scannerae Data Exploration and Visualization
fp = r’/eontent/india-polygon.shp’
map_df = epd.read_file(tp)
map_df_copy = gpd.read_file(fp)
map_df.head()
Step 3 : Download the mapping data
hon
https://github.com/Princenihith/Maps_with_pyt
Step 4: Load the data
[6] fp = r'/content/india-polygon. shp’
nap_df = gpd.read_file(fp)
map_df_copy = gpd.read_file(fp)
nap_df.head()
id st_pa geonetry 7
0 None Andaman and Nicobar Islands MULTIPOLYGON (((93.84831 7.24028, 93.92705 7.0...
4 None ‘Arunachal Pradesh POLYGON ((95.23643 26,68105, 95.19594 27,03612...
2 None Assam POLYGON ((95.19594 27.03612, 95.08795 26.94578...
3 None
Bihar POLYGON ((88.11357 26.54028, 88 28006 26 37640...
4 None Chandigarh POLYGON ((76.4208 30.76124, 76.83758 30.72552...
© scanned with Oken Scannerpractical Exercises
45
step 5: Plotting the Shapefiles
map_df-plot()
[7] map_df.plot()
a fi
matplotlib.axes._subplots.AxesSubplot at 0x7fabofs3bcdo>
nb 8s & BS w
eld
iv
&
ae
Ree a
wie
eC
<<
Ss
. &
Step 6 : Adding better data insights into the map
Is_ df= pd.read_csv(‘/content/globallandslides.csv’)
pd.set_option(‘display.max_columns’, None)
\s df= Is_dffls_df.country_name=="India”]
Is df[*Year"] = pd.to_datetime(Is_dflevent_date"]).dt.year
\s_ df= Is_df[ls_df.landslide_category=="landslide”]
\s df[“admin_division_name”].replace(“Nagaland”, "Nagaland”,inplace = True)
Is df[‘admin_division_name”]-replace(“Meghalaya”, Meghalaya”,inplace = True)
\s df[“admin_division_name”].replace(“Tamil Nadu”, "Tamil Nadu” inplace = True)
Is df[‘admin_division_name”].replece(“Karnataka”, ”Karnataka”,inplace = True)
\s dff“admin_division_name”].replace(“Gujarat”, Gujarat” inplace = True)
Is d“admin_division_name”]-replace(“Aruniichal Pradesh”, Arunachal Pradesh” inplace = True)
state_df = Is_df[“admin_division_name”] .value_counts()
state_df = state_df.to_frame()
state_df.reset_index(level=0, inplace=True)
state_df.columns = [‘State’, *Count’]
state_df.at{15,”Count”] = 69
State_df.at(0,”State”] = "Jammu and Kashmir”
state_df.at[20,”State”] = Delhi”
State_df.drop(7)
© scanned with Oken Scanner46
90
°
1
2
3
4
6
6
8
°
10
"
2
0
1“
15
16
7
18
oT
20
at
2
23
2
25
26
a
20
Jammu and Kashmir
Utterknand
Himachal Pradesh
‘Assam
Nagaland
Manoresnire
Manipur
Korat
Arunachol Pradesh
‘Tamil Nadu
Kamateke
‘Sikkim
Meghalaya
Maoram
‘West Bengal
Goa
Andhra Predesh
Rojastnan
Odisna
Doint
NeT
Topura
Haryana,
Gujarat
Uttar Pradesh
State of Odisha
‘Madnya Pradesh
Binar
pata Exploration and Visualizatig,
zANUNNBYOD
Step 7: Merge the data
merged = map_df.set_index(‘
merged{‘Count’] = merged[‘Count’]
merged.head()
St_nm’),join(state_df.set_index(State’))
replace(np.nan, 0)
© scanned with Oken ScannerPractical Exercises 47
af sk) ooln¢state se Indentstate")
serged.nenay) |” pherehed oa)
“ ceometry come Zt
sm
‘Andaman and Wiecberalanés None MULTIFOLYGON (0304831 724028,000270570.. 00
‘Arunachal Pradesh None POLYGON (6 29089 20 60108,05.0804 2703012.. 400
‘Assam None POLYGON (5.10604 27.0362, 9508708 2604878. 740
Bae None POLYGON (0.11957 2654028, 082000626 37640.. 10
Chandigarh Nove POLYGON (78.4208 9076124, 7.00788 3072882... 00
Step 8 : Plotting the data on the Shapefile
fig, ax = plt.subplots(1, figsize=(10, 10))
ax.axis(‘off”)
ax.set_title(‘Number of landslides in India state-
wise’, fontdict={‘fontsize’: °20', *fontweight’ : ’10'})
# Plot the figure
merged.plot(column=’Count’,cmap="Y1OrRd’, linewidth=0.8, ax=ax, edgecolor="0",
legend=True,markersize=[39.739192, -
104.990337], legend_kwds=({‘label’: "Number of landslides”})
in eater of tne tea sant, mete ate 9
Fintan Cor sme VOR, ROE A ta gee apne snr, ESN gp ee i Nee Sea
pet nen sats ansont a ete
‘Number of landslides in indi sta
© scanned with Oken Scanner48 Data Exploration and Visualizatig
Example No.8. Perform EDA on Wine Quality Data Set.
-Wi
Download the datase ttps://github.com/aniruddhachoudhury/Red ‘Wine-
Quality/blob/master/winequality-red.csv
winequality-red.csv 7
gare: =
Meet ect Perit fom pened tn!
| oy ee pT) tase
Sie ese) oe ow dS PP
22 ston s pa conten ras ot Mart ec fm
"8 Jrowenee PETG Ss EBT ER enacts Gov WA ee aimee OT Lom \
Sew a newt sa -
F ¢
1 fueeeey tsa
2 a a
3 a ‘
‘ Fy sl
3 u 2
‘ a xl_asmn| asi ose] 9a]
7 al ‘ol_agn| stl ose! od
‘ 2 ‘aol eae ) x saf_ageet] 33] oa!
’ 7 iol — esa Pi lass] 339] oa)
2a on toon rl ‘al ans] se] 057] 3
ua os] aad as) con] 7 rot _eovnl as] eal ans
24 est] cel 16] sf lay] aa oss
a3 ase ea] oon a inl_osen] 33st asf ans
4" rT ons] 1g oc] ae sf _esul ssf os st
B_ coe] 1 oun oI Bl evmd aa asa
6 aa} oul as) anf BY 16) eved] 3a] os] 93
nts cal eu] as) ant ai aloes! 317] omit af
us en] —ostd tal con a] talent asl on) tos) 7]
sl as ol 1} onal ro saga aul ant a
»_ esi] eo eo zl_amnl ant of 7d
aa en} ost] 1a) esa] v lav) aca) neal oa
2 cz] eusl nal oor ol oval an! os! oad __d
3 onl al asa 2 7 evel in ent >
xg ea] eat) ae a_ayve 31] esil ss
——— a eer
Step 1: Importing some essential libraries in Python,
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
{2}, import numpy as np
import pandas -as-pd
import matplotlib.pyplot as pit
import seaborn as sns
© scanned with Oken Scannerpractical Exercises
Step 2: read data set =
train_df = pd.read_esv(“/contentiwine
lity. ”
train_df.sample(6) ne
© taint Ween enc nner te
trite) em)
c
{tt wy te ey cade ards tt len tet ey IAs ty
wm " om om no mo wo tme se se nm §
“on wm mono x no tw sn tw nom 4
a er ey oo tm 1% sum 5
wom oak am ws wy vies a9 yar
mom om aes a ayn at sams
Cr 2 ss twess dam 4
z
Step 3:checkull element in the data
train_df.isnull().sum()
& trains. isnull()-sum()
fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol
it
qe te: intes
00H H9HHHKHDOD
lues
Step 4: statistical summary, excluding NaN va
train_df.describeQ)
© scanned with Oken Scanner50,
© vanercncerme
osty
od
amr
70
tom
Magagradd
i
Step 5: Exploratory Data Analysis
plt.figure(figsize=(15,15))
setae
0 ow000
sere
om
fev at sea te setty ee stmates eat
sown
roma
Data Exploration and Visualization
ris eewes
tam caeo wm osme0 ma cos00 HNO TSMR Isr
arm onarer 3309 OEE DEW try,
sevwsce center 0yS0N8 | ONEHOT 10S or
toomce —owwore «27450003009 FEO ay
rome —ommuco 32000 OMNENDD NEED Sees
a a ra
‘evorene omrins = 2uecom ona Aeag
‘iooron vensno «012000 200000 490m A
sns.heatmap(train_df.corr(),color = "K”, annot=True)
Step:6 Quality of the Wine(from the Data)
The following features are relatively correlated:
total sulfur dioxide with free sulfur dioxide; fixed acidity with density and
citric acid; alcohol with quality.
The following features are inversely correlated:
fixed acidity with pH citric acid with pH and volatile acidity
© scanned with Oken Scannerpractical Exercises 51
© ens. countplot(xs ‘quality’, datactrain f)
emotplotlib.aces-_subplots .AxesSubplot at Ox7fbc10562¢507
ost
oS $38 8 88
3
ausiny
eo sns.countplot(x="alcohol’, data=train_df)
Cc cmatplotlib.axes._subplots.Axessubplot at Ox7fbc10b885d0>
Step 7: Different Plots
s between fixed acidity a'
nd quality
Relation:
plt.figure(figsize=(15.5))
sns.swarmplot(x= quality”, ¥
plt.title(‘fixed acidity and quality’)
="fixed acidity” , data = train_df)
© scanned with Oken Scanner2 Data Exploration and Visualizatic
rift aay ate «eta
site neta walt)
‘mann. gins wan Hof coe He mE secre
cg, rare)
rere p26: Wg 3.1 te ple comet pee yo nym cent
i)
ety ot att)
on td ey
|
Relations between Relations between alcohol & chlorides
sns.swatmplot(x= “alcohol”, y="chlorides” , data = train_df)
plt.title(‘Relations between alcohol & chlorides’)
Aelatons betevenakahal & londes
Step 8: Plot
Step 8.1 :Relations between fixed acidity and quality,
plt.figure(figsize=(15,5))
sns.boxplot(x="quality”
“fixed acidity”, data=train_df )
© scanned with Oken ScannerPractical Exercises
»
Yerfbees eteity",—tantrainet )
53
sositolets Avesttptot at eeftcettocn
Step 8.2: Relationship between alcohol & chlorides
plt.figure(figsize=(15,5))
sns.boxplot(x="alcohol”, y=”chlorides”, data=train_df )
© pit. figure tigstzea(as,5))
sns.boxplot(xe"elcohol”, yeTenlorices", dateateain.6f )
Ce erstolethtb.exes._sunplots axessubplot at ex7fber000"edd>
oe .
os |
whe : po |
"
olf tit i i Atha sito J
train_ df-groupby(‘quality’)[‘fixed acidity’].mean().plot.line)
pit. ylabel(“fixed acidity”)
4 .groupby( equality!) Fixed pcioity’ Jemean()-plot.2ine()
22 teeter encetined oetalty")
qext(o, ous, “fixed sesaity’?
© scanned with Oken ScannerEl Data Exploration and Visualization
Step 8.3: Relations between alcohol and chloride
plt.figure(figsize=(10,4))
sns,barplot(x="alcohol”, y="chlorides”, data=train_df)
[aay pit. figure tiestzend
Shs berplot(xeslechol,
coteetratnot )
snvtotorni.tnes,svoters Aresucelot a ex cettoese»
h
lac
Step 8.4: Relations between volatile acidity and ee
plt.figure(figsize=(10,4))
sns.barplot(x="quality”, y="volatile acidity”, data=train_df )
ne tinwetesiesnne
° ‘sns.barplott yervoletile ecidity", ¢atastrain_cf )
Step 8.5:Relations between quality and volatile acidity
train_df.groupby(‘quality’)[‘volatile acidity’ ].mean().plot.line()
plt.ylabel(“volatile acidity”)
© trainer,
serounby
pitayiepel voles
1, ‘volatile sereity")
sty" )C-volnttte actotty"]menn() plot Dine()
etetty")
—_—
© scanned with Oken ScannerPractical Exercises 55
Step 8.6: Relation between quality and sulphates
plt.figure(figsize=(10,4))
sns.barplot(x="quality”, y=
ulphates”, data=train_df)
Step 8.7: Group by:
train_df.groupby(‘quality’)[‘sulphates' ].mean().plot.line()
plt.ylabel(“sulphates”)
@ train_dé.groupby( ‘quality')[ "sulphates '].mean().plot-1ine()
plt.ylabel (“sulphates”)
CG Texte, @.5, ‘sulphates*)
77s
750
72s
0700
067s
060
oszs
0600
os7s.
Step 8.8: Realtion between quality and sulphates
sns.boxplot(x="quality”, y="sulphates”, data=train_df )
© sns.boxplot(xe"quality", ys"sulphates", datastrain_of )
tplotlib.axes._subplots.axesSubplot at Ox7fbcof20Ff1e>
© scanned with Oken Scanner