0% found this document useful (0 votes)
14 views35 pages

DEV Lab Manual Student

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views35 pages

DEV Lab Manual Student

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

[Link].

1 Install the data Analysis and Visualization tool: R/ Python /Tableau Public/
Power BI

Aim:

To install the data Analysis and Visualization tool: R/ Python /Tableau Public/
Power BI.
Steps to install python in desktop:

Step 1:Download the python installer binaries

1. Open the official Python website in your web browser. Navigate to the
Downloads tab for Windows.
2. Choose the latest Python 3 release. In our example, we choose the latest Python
3.7.3 version.
3. Click on the link to download Windows x86 executable installer if you are using a
32-bit installer. In case your Windows installation is a 64-bit system, then
download Windows x86-64 executable installer.

Step 2: Run the executable installer

Once the installer is downloaded, run the Python installer.


1. Check the Install launcher for all users check box. Further, you may check the Add
Python 3.7 to path check box to include the interpreter in the execution path.4.
Select Customize installation. Choose the optional features by checking the
following check boxes:
2. Documentation
3. pip
4. tcl/tk and IDLE (to install tkinter and IDLE)
5. Python test suite (to install the standard library test suite of Python) 6. Install the
global launcher for `.py` files. This makes it easier to start Python 7. Install for all
users.
8. Click Next.7. This takes you to Advanced Options available while installing Python.
Here, select the Install for all users and Add Python to environment variables check
boxes. Optionally, you can select the Associate files with Python, Create shortcuts
for installed applications and other advanced options. Make note of the python
installation directory displayed in this step. You would need it for the next step.
After selecting the Advanced options, click Install to start installation.
9. Once the installation is over, you will see a Python Setup Successful window.
Step 3: Add Python to the Environmental variable

The last (optional) step in the installation process is to add Python Path to the System
Environment variables. This step is done to access Python through the command line. In
case you have added Python to environment variables while setting the Advanced options
during the installation procedure, you can avoid this step. Else, this step is done manually
as follows. In the Start menu, search for “advanced system settings”. Select “View
advanced system settings”. In the “System Properties” window, click on the “Advanced”
tab and then click on the “Environment Variables” button. Locate the Python installation
directory on your system. If you followed the steps exactly as above, python will be
installed in below locations:

● C:\Program Files (x86)\Python37-32: for 32-bit installation


● C:\Program Files\Python37-32: for 64-bit installation

The folder name may be different from “Python37-32” if you installed a different
version. Look for a folder whose name starts with Python. Append the following
entries to PATH variable

Step 4: Verify the Python installer

You have now successfully installed Python 3.7.3 on Windows 10. You can verify if the Python
installation is successful either through the command line or through the IDLE app that gets
installed along with the installation. Search for the command prompt and type “python”. You can
see that Python 3.7.3 is successfully installed.

Result:

The data Analysis and Visualization tool: R/ Python /Tableau Public/ Power BI is installed
and verified .
[Link]. 2 Perform exploratory data analysis (EDA) on with datasets like email data set.
Export all your emails as a dataset, import them inside a pandas data frame, visualize them
and get different insights from the data

Aim:
To perform exploratory data analysis(EDA) with email data set
Algorithm:
• Algorithm: Define the date range for the time series data, specifying the start date, end date, and
frequency of observations.
• Generate random values to represent the time series data. In this case, we'll use a random walk
pattern where each value is the cumulative sum of random Gaussian noise.
• Create a DataFrame to store the synthetic time series data, with columns for dates and
corresponding values.
• Set the date column as the index of the DataFrame to facilitate time-based indexing and
visualization.
• Print the first few rows of the synthetic dataset to provide an overview of the data structure.
• Plot the synthetic time series data using Matplotlib, with dates on the x-axis and corresponding
values on the y-axis.
• Visualize the distribution of values in the synthetic time series data using a histogram. This plot
helps understand the spread and shape of the data distribution.
• Compute the rolling mean of the time series data to visualize the trend over time.
• Plot the original time series data along with the rolling mean trend line to observe any
underlying trends.
• Display the generated plots to the user for visual analysis and interpretation.

Program:

!pip install mailbox


from [Link] import drive
[Link]('/content/gdrive')
import mailbox
mboxfile = "gdrive/My Drive/Colab Notebooks/[Link]"
mbox = [Link](mboxfile)
mbox
for key in mbox[0].keys():
print(key)

Import csv
with open('[Link]', 'w') as outputfile:
writer = [Link](outputfile)
[Link](['subject','from','date','to','label','thread'])

for message in mbox:


[Link]([message['subject'], message['from'],
message['date'], message['to'],
message['X-Gmail-Labels'], message['X-GM-THRID']])
dfs = pd.read_csv('[Link]', names=['subject', 'from', 'date', 'to',
'label', 'thread'])
[Link]

dfs['date'] = dfs['date'].apply(lambda x: pd.to_datetime(x,


errors='coerce', utc=True))
dfs = dfs[dfs['date'].notna()]
dfs.to_csv('[Link]')
[Link]()
[Link](10)
[Link]
import re
ef extract_email_ID(string):
email = [Link](r'<(.+?)>', string)
if not email:
email = list(filter(lambda y: '@' in y, [Link]()))
return email[0] if email else [Link]
dfs['from'] = dfs['from'].apply(lambda x: extract_email_ID(x))

myemail = 'itsmeskm99@[Link]' dfs['label'] =


dfs['from'].apply(lambda x: 'sent' if x==myemail else
'inbox')
[Link](columns='to', inplace=True)
[Link](10)

import datetime
import pytz

def refactor_timezone(x):
est = [Link]('US/Eastern')
return [Link](est)
dfs['date'] = dfs['date'].apply(lambda x: refactor_timezone(x))
dfs['dayofweek'] = dfs['date'].apply(lambda x: x.weekday_name)
dfs['dayofweek'] = [Link](dfs['dayofweek'],
categories=['Monday', 'Tuesday',
wednesday', 'Thursday', 'Friday','Saturday', 'Sunday'], ordered=True)
dfs['timeofday'] = dfs['date'].apply(lambda x: [Link] + [Link]/60 +
[Link]/3600)
dfs['hour'] = dfs['date'].apply(lambda x: [Link])
dfs['year_int'] = dfs['date'].apply(lambda x: [Link])
dfs['year'] = dfs['date'].apply(lambda x: [Link] + [Link]/365.25)
[Link] = dfs['date']
del dfs['date']
print([Link]().strftime('%a, %d %b %Y %I:%M %p'))
print([Link]().strftime('%a, %d %b %Y %I:%M %p'))

import [Link] as plt


from [Link] import MaxNLocator
def plot_todo_vs_year(df, ax, color='C0', s=0.5, title=''):
ind = [Link](len(df), dtype='bool')
est = [Link]('US/Eastern')

df[~ind].[Link]('year', 'timeofday', s=s, alpha=0.6, ax=ax,


color=color)
ax.set_ylim(0, 24)
[Link].set_major_locator(MaxNLocator(8))
ax.set_yticklabels([[Link](str(int([Link](ts, 24))),
"%H").strftime("%I%p") for ts in ax.get_yticks()]) ;
ax.set_xlabel('')
ax.set_ylabel('')
ax.set_title(title)
[Link](ls=':', color='k')
return ax
sent = dfs[dfs['label']=='sent']
received = dfs[dfs['label']=='inbox']
fig, ax = [Link](nrows=1, ncols=2, figsize=(15, 4))
plot_todo_vs_year(sent, ax[0], title='Sent')
plot_todo_vs_year(received, ax[1], title='Received')

def plot_number_perday_per_year(df, ax, label=None, dt=0.3,


**plot_kwargs):
year = df[df['year'].notna()]['year'].values
T = [Link]() - [Link]()
bins = int(T / dt)
weights = 1 / (np.ones_like(year) * dt * 365.25)
[Link](year, bins=bins, weights=weights, label=label,
**plot_kwargs);
[Link](ls=':', color='k')
from scipy import ndimage

def plot_number_perdhour_per_year(df, ax, label=None, dt=1,


mooth=False,weight_fun=None, **plot_kwargs):

tod = df[df['timeofday'].notna()]['timeofday'].values
year = df[df['year'].notna()]['year'].values
Ty = [Link]() - [Link]()
T = [Link]() - [Link]()
bins = int(T / dt)
if weight_fun is None:
weights = 1 / (np.ones_like(tod) * Ty * 365.25 / dt)
else:
weights = weight_fun(df)
if smooth:
hst, xedges = [Link](tod, bins=bins, weights=weights);
x = [Link](xedges, -1) + 0.5*(xedges[1] - xedges[0])
hst = ndimage.gaussian_filter(hst, sigma=0.75)
f = interp1d(x, hst, kind='cubic')
x = [Link]([Link](), [Link](), 10000)
hst = f(x)
[Link](x, hst, label=label, **plot_kwargs)
else:
[Link](tod, bins=bins, weights=weights, label=label, **plot_kwargs);
[Link](ls=':', color='k')
orientation = plot_kwargs.get('orientation')
if orientation is None or orientation == 'vertical':
ax.set_xlim(0, 24)
[Link].set_major_locator(MaxNLocator(8))
ax.set_xticklabels([[Link](str(int([Link](ts,
24))), "%H").strftime("%I %p")
for ts in ax.get_xticks()]);
elif orientation == 'horizontal':
ax.set_ylim(0, 24)
[Link].set_major_locator(MaxNLocator(8))
ax.set_yticklabels([[Link](str(int([Link](ts, 24))),
"%H").strftime("%I %p")
for ts in ax.get_yticks()]);
class TriplePlot:
def init (self):
gs = [Link](6, 6)
self.ax1 = [Link](gs[2:6, :4])
self.ax2 = [Link](gs[2:6, 4:6], sharey=self.ax1)
[Link](self.ax2.get_yticklabels(), visible=False);
self.ax3 = [Link](gs[:2, :4])
[Link](self.ax3.get_xticklabels(), visible=False);
def plot(self, df, color='darkblue', alpha=0.8, markersize=0.5,
yr_bin=0.1, hr_bin=0.5):
plot_todo_vs_year(df, self.ax1, color=color, s=markersize)
plot_number_perdhour_per_year(df, self.ax2, dt=hr_bin, color=color,
alpha=alpha, orientation='horizontal')
self.ax2.set_xlabel('Average emails per hour')
plot_number_perday_per_year(df, self.ax3, dt=yr_bin, color=color,
alpha=alpha)
self.ax3.set_ylabel('Average emails per day')
import [Link] as gridspec
import [Link] as mpatches

[Link](figsize=(12,12));
tpl = TriplePlot()

[Link](received, color='C0', alpha=0.5)


[Link](sent, color='C1', alpha=0.5)
p1 = [Link](color='C0', label='Incoming', alpha=0.5)
p2 = [Link](color='C1', label='Outgoing', alpha=0.5)
[Link](handles=[p1, p2], bbox_to_anchor=[1.45, 0.7], fontsize=14,
shadow=True);

counts = [Link].value_counts(sort=False)
[Link](kind='bar')

addrs = received['from'].value_counts()

addrs[0:4]
[Link](figsize=(12,12));

tpl = TriplePlot()

labels = []
colors = ['C{}'.format(ii) for ii in range(9)]
idx = [Link]([1,2,3,7])
for ct, addr in enumerate([Link][idx]):
[Link](dfs[dfs['from'] == addr], color=colors[ct], alpha=0.3,
yr_bin=0.5, markersize=1.0)

[Link]([Link](color=colors[ct], label=addrs[0:4],
alpha=0.5))
[Link](handles=labels, bbox_to_anchor=[1.4, 0.9], fontsize=12,
shadow=True);

sdw = [Link]('dayofweek').size() / len(sent)


rdw = [Link]('dayofweek').size() / len(received)

df_tmp = [Link](data={'Outgoing Email': sdw, 'Incoming


Email':rdw})
df_tmp.plot(kind='bar', rot=45, figsize=(8,5), alpha=0.5)
[Link]('');
[Link]('Fraction of weekly emails');
[Link](ls=':', color='k', alpha=0.5)
import [Link]
from [Link] import interp1d

[Link](figsize=(8,5))
ax = [Link](111)
for ct, dow in enumerate([Link]):
df_r = received[received['dayofweek']==dow]
weights = [Link](len(df_r)) / len(received)
wfun = lambda x: weights
plot_number_perdhour_per_year(df_r, ax, dt=1, smooth=True,
color=f'C{ct}',
alpha=0.8, lw=3, label=dow, weight_fun=wfun)

df_s = sent[sent['dayofweek']==dow]
weights = [Link](len(df_s)) / len(sent)
wfun = lambda x: weights
plot_number_perdhour_per_year(df_s, ax, dt=1, smooth=True,
color=f'C{ct}',
alpha=0.8, lw=2, label=dow, ls='--', weight_fun=wfun)
ax.set_ylabel('Fraction of weekly emails per hour')
[Link](loc='upper left')

from wordcloud import WordCloud

df_no_arxiv = dfs[dfs['from'] != 'no-reply@[Link]']


text = ' '.join(map(str, sent['subject'].values))
stopwords = ['Re', 'Fwd', '3A_']
wrd = WordCloud(width=700, height=480, margin=0,
collocations=False)
for sw in stopwords:
[Link](sw)
wordcloud = [Link](text)

[Link](figsize=(25,15))
[Link](wordcloud, interpolation='bilinear')
[Link]("off")
[Link](x=0, y=0)

Result:
Thus, the exploratory data analysis with email data set is
execute and verified
[Link]. 3 Working with Numpy arrays, Pandas data frames , Basic plot
using matplotlib

Aim:

To work with Numpy arrays, Pandas data frames , Basic plots using Matplotlib.
Packages:
#Numpy:

Numpy is the core library for scientific computing in Python. It provides a


high-performance multidimensional array object, and tools for working with these array
Program:
→ One dimensional array
# importing numpy module
import numpy as np
# creating list
list = [1, 2, 3, 4]
# creating numpy array
sample_array = [Link](list1)
print("List in python : ", list)
print("Numpy Array in python:",sample_array)

→Multi dimensional array


list_1 = [1, 2, 3, 4]
list_2 = [5, 6, 7, 8]
list_3 = [9, 10, 11, 12]
sample_array = [Link]([list_1,list_2,list_3])
print("Numpy multidimensional array in python\n",sample_array)
Anatomy of an array:
→Axis: The Axis of an array describes the order of the indexing into the
Array
Axis 0 = one dimensional
Axis 1 = two dimensional
Axis 2 = three dimensional
→ Shape: The number of elements along with each [Link] is from a tuple.

# importing numpy module


import numpy as np
# creating list
list_1 = [1, 2, 3, 4]
list_2 = [5, 6, 7, 8]
list_3 = [9, 10, 11, 12]
# creating numpy array
sample_array = [Link]([list_1,list_2,list_3])
print("Numpy array :")
print(sample_array)
# print shape of the array
print("Shape of the array :",sample_array.shape)

→ Rank: The rank of an array is simply the number of axes. It has

One dimensional array has Rank 1


Two dimensional array has Rank2

→Data type objects (dtype):

Data type objects (dtype) is an instance of [Link] class. It


describes how the bytes in the fixed-size block of memory corresponding to an array
should be interpreted

# Import module
import numpy as np
# Creating the array
sample_array_1 = [Link]([[0, 4, 2]])
sample_array_2 = [Link]([0.2, 0.4, 2.4])
# display data type
print("Data type of the array 1 :",sample_array_1.dtype)
print("Data type of array 2 :",sample_array_2.dtype)
#Pandas Frame:
Pandas DataFrame is a two-dimensional size-mutable, potentially
heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is
a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and
columns. Pandas DataFrame consists of three principal components, the data, rows, and
columns

Basic operation which can be performed on Pandas DataFrame:


● Creating a DataFrame
● Dealing with rows and columns
● Indexing and selected data
● Working with missing data
● Iterating over rows and columns

Creating a Pandas DataFrame


In the real world, a Pandas DataFrame will be created by loading the datasets from
existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas
DataFrame can be created from the lists, dictionary, and from a list of dictionaries
etc. Dataframe can be created in different ways here are some ways by which we
create a dataframe
import pandas as pd
lst = ['Geeks', 'For', 'Geeks', 'is', 'portal', 'for', 'Geeks']
data = {'Name':['Tom', 'nick', 'krish', 'jack'],'Age':[20, 21, 19, 18]}
df = [Link](lst)
df1=[Link](data)

Dealing with rows and columns:


A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular
fashion in rows and columns. We can perform basic operations on rows/columns like
selecting, deleting, adding, and renaming.

Columns selection : In Order to select a column in Pandas DataFrame, we can either


access the columns by calling them by their columns name
import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
df = [Link](data)
print(df[['Name', 'Qualification']])

Row selection : Pandas provide a unique method to retrieve rows from a Data
frame. Dataframe loc[ ] method is used to retrieve rows from Pandas DataFrame.
Rows can also be selected by passing integer location to an iloc[ ]function.
Import pandas as pd
Data = pd.read_csv("[Link]", index_col ="Name")
first = [Link]["Avery Bradley"]
second = [Link]["R.J. Hunter"]
print(first, "\n\n\n", second)

Indexing and Selecting Data:


Indexing in pandas means simply selecting particular rows and columns of data from a
DataFrame. Indexing could mean selecting allthe rows and some of the columns, some of
the rows and all of the columns, or some of each of the rows and columns. Indexing can
alsobe known as Subset Selection .

Indexing a Dataframe using indexing operator [] :


Indexing operator is used to refer to the square brackets following an object. The
.loc and .iloc indexers also use the indexing operator to make selections. In this indexing
operator to refer to df[].

# importing pandas package


import pandas as pd
# making data frame from csv file
data=pd.read_csv("[Link]", index_col="Name")
# retrieving columns by indexing operator
first = data["Age"]
print(first)

Indexing a DataFrame using .loc[ ] :

This function selects data by the label of the rowsand columns. The
[Link] indexer
selects data in a different way than justthe indexing operator. It can select
subsets of rows or columns. It can also simultaneously select subsets of rows and
columns.
import pandas as pd
data=pd.read_csv("[Link]", index="Name")
first = [Link]["Avery Bradle
second = [Link]["R.J. Hunter"]
print(first, "\n\n\n", second)

Indexing a DataFrame using .iloc[ ] :


This function allows us to retrieve rows and columns by position.
In order to do that, we’ll need to specify the positions of the rows that we want, and the
positions of the columns that we want as well. The [Link] indexer is very similar to [Link] but
onlyuses integer locations to make its selections.

import pandas as pd
# making data frame from csv file
data=pd.read_csv("[Link]",
index_col="Name")
# retrieving rows by iloc method
row2 = [Link][3]
print(row2)
#Matplotlib
Basic plots in matplotlib:
Matplotlib comes with a wide variety of [Link] helps to understand trends, patterns,
and to make correlations. They’re typically instruments for reasoning about quantitative
information.
Line Plot:
from matplotlib import pyplot as plt
x = [5, 2, 9, 4, 7]
y = [10, 5, 8, 4, 2]
[Link](x,y)
[Link]()
Bar plot:
from matplotlib import pyplot as plt
x = [5, 2, 9, 4, 7]
y = [10, 5, 8, 4, 2]
[Link](x,y)
[Link]()
Histogram:
from matplotlib import pyplot as plt
y = [10, 5, 8, 4, 2]
[Link](y)
[Link]()
Scatter plot:

from matplotlib import pyplot as plt


x = [5, 2, 9, 4, 7]
y = [10, 5, 8, 4, 2]
[Link](x, y)
[Link]()

Result:
Thus, worked with Numpy arrays, Pandas data frames , Basic
Plot using matplotlib is done
[Link]. 4 Explore various variables and row filters in R for cleaning data .
Apply various plot features in R on sample data sets and visualize

Aim:

To explore various variables and row filters in R for cleaning data . Apply various
plot features in R on sample data sets and visualize

Algorithm:
• Assign numeric, integer, character, logical, and complex values to variables.
• Perform basic arithmetic operations such as addition, subtraction, multiplication,
division, exponentiation, and modulus.
• Create sequences using the : operator and the seq() function.
• Create vectors of different types including numeric, integer, character, and logical.
• Create matrices using the matrix() function.
• Combine vectors into matrices using cbind() function.
• Create lists containing different types of data.
• Create data frames using the [Link]() function.
• Read datasets from CSV files using the [Link]() function.
• Perform basic operations on datasets such as finding the mean of a column.
• Define functions without arguments and with arguments.
• Demonstrate the use of the return statement in functions.
• Plot different types of charts including bar plots, dot charts, pie charts, histograms, and
box plots using functions like plot(), barplot(), dotchart(), pie(), hist(), and boxplot().
Program:

# Basic Operation in R

print("hello world") #assigning


m = 12
g <- 24
#data types
mohan = 100#numeric
gopi = 255.36#also numericlucifer =
30L#integer
alan = 'mohan' # characterkolo =
FALSE #logical july = 4 + 12i
#complex #arithmetic operation
a = 12 +12 #addition
b = 23 - 32 #subtraction
c = 43 * 52 #multiplication
d = 20 / 2 #division e1 = 2^4
#power off e2 = 2**4 #power off
f = 55%%2 #print the reminder#logical operators

a1 = 2>1#greater than
a2 = 2<1#lesser than
a3 = 2<= 2 #lesser than or equal to
a4 = 2>=1 #greater than at equator
a5 = 2!=2#not equal to
a6 = 12 == 12 #equal too
#sequence
#possibility : 1
seq = 1:10
#possibility
seq 1 = seq(1,10)
#Data Structures
#Vectors
num_vec = c(12,21,12)
int_vec = c(12L,12L,12L)
char_vec = c('mohan','gopi','lucifer')
bool_vec = c(T,T,F)
mix_vec = c('mohan',T,2,2L)
#character,numeric,integer,logical
#matrics
matric = matrix(1:10,nrow=2,ncol=10)
#combine two vectors in matrices
c_bind = cbind(num_vec,char_vec)
#store different types of data
diff_data = list(c(1,2,3),'modafinil',1:10)
#dataframes creation
plain_df = [Link](names = c('mohan','gopi'),
[Link] = c(2,12))
#read a dataset like csv file
#dataset <- [Link]([Link]())

dataset = [Link]("/Users/91965/Documents/supermarket_sales -
[Link]")
#finding mean of a column
mean = mean(dataset[,7])
#conditional statement (if,else,elseif)
x=12
y=25
#if condition if(condition){expression}
if(x==y){
print("2nd year AI&DS are good")}
#if else condition if(condition){expression}else{expression}
if(x>=y){
print("hello world")
}else{
print("hello hell")
}
if(x>=y){print("parley")
}else if(x<=y){
print("kolo")
}
#for loop
#for(condition){expression}
for(i in 1:5){print("hell")
print("mohan")}
vec = c(21,12,12)
for(value in c_bind){print(value)}
#while loop
#while (condition){expression}
value =1

while (value<5){print("hello")
value = value +1}
#infinity while loop

#functions
#without argument
m = function(){print("mohan")}
#with argument
g = function(arg){arg**2}
l = function(arg1,arg2){
loc = arg1+arg2
return(loc)
}
various plots in r
#bar plots
#plot(iris$[Link],iris$[Link] ,type = "l")
barplot(iris$[Link] ,ylim = c(0,8) xlab="iris-sepal-length",col=c(2,3,7) ,
main=”BAR PLOT”)
#dotcharts
dotchart(iris$[Link],ylim = c(0,100),xlab = "iris-sepal-width", main="
DOTCHARTS",pch=2)
#pie charts
pie(iris$[Link], col = rainbow(9),main = "PIE CHARTS")
#histogram
hist(irs$[Link],xlab ="Petal-Width",main="HISTOGRAM",col=rainbow(9))
#boxplot
boxplot(iris$[Link],iris$[Link],names = c("Petal-Length","
"Petal-Width"),main='BOX PLOT',col = rainbow(9))

Result:
Thus the various variables and row filters in R for cleaning data is executed and applied
various plot features in R on sample data setsvisualized
[Link]. 5 Perform Time Series Analysis and apply the various visualization
techniques

Aim:

To perform Time Series Analysis and apply the various visualization techniques

Algorithm:
• Define a vector x containing weekly COVID-19 positive cases data.
• Import the lubridate library to utilize the decimal_date() function for date
conversion.
• Create a time series object mts using the ts() function with the appropriate start
date and frequency.
• Plot the time series graph using the plot() function with labels and title.
• Save the plot as a PNG file using png() and [Link]() functions.
Program:

# Weekly data of COVID-19 positive cases from


# 22 January, 2020 to 15 April, 2020
x <- c(580, 7813, 28266, 59287, 75700, 87820, 95314, 126214, 218843,
471497,
936851, 1508725, 2072113)
# library required for decimal_date() function
library(lubridate)
# output to be created as png file
png(file ="[Link]")
# creating time series object
# from date 22 January, 2020
mts <- ts(x, start = decimal_date(ymd("2020-01-22")),frequency = 365.25 / 7)
# plotting the graph
plot(mts, xlab ="Weekly Data",
ylab ="Total Positive Cases",
main ="COVID-19 Pandemic",
[Link] ="dark green")
# saving the file
[Link]()
MULTIVARIATE TIME SERIES
Multivariate time series is creating multiple time series in a single chart

# Weekly data of COVID-19 positive cases and


# weekly deaths from 22 January, 2020 to
# 15 April, 2020
positiveCases <- c(580, 7813, 28266, 59287,75700, 87820, 95314,
126214,218843, 471497, 936851,1508725, 2072113)
deaths <- c(17, 270, 565, 1261, 2126, 2800, 3285, 4628, 8951, 21283, 47210,
88480, 138475)
# library required for decimal_date() function
library(lubridate)
# output to be created as png file
png(file ="[Link]")
# creating multivariate time series object
# from date 22 January, 2020
mts <- ts(cbind(positiveCases, deaths),
start = decimal_date(ymd("2020-01-22")),frequency = 365.25 / 7)
# plotting the graph
plot(mts, xlab ="Weekly Data",
main ="COVID-19 Cases",
[Link] ="dark green")

# saving the file


[Link]()
FORECASTING
Forecasting can be done on time series using some models present in R. In here
arima() model is used
# Weekly data of COVID-19 cases from
# 22 January, 2020 to 15 April, 2020
x <- c(580, 7813, 28266, 59287, 75700,87820, 95314, 126214, 218843,
471497, 936851, 1508725, 2072113)
# library required for decimal_date() function
library(lubridate)
# library required for forecasting
library(forecast)
# output to be created as png file
png(file ="[Link]")
# creating time series object
# from date 22 January, 2020
mts <- ts(x, start = decimal_date(ymd("2020-01-22")),frequency = 365.25)
# forecasting model using arima model
fit <- [Link](mts)
# Next 5 forecasted values
forecast(fit, 5)
# plotting the graph with next
# 5 weekly forecasted values
plot(forecast(fit, 5), xlab ="Weekly Data", ylab ="Total Positive Cases",
main ="COVID-19 Pandemic", [Link] ="darkgreen")
# saving the file
[Link]()

Result:
Thus performed an Time Series Analysis and applied the
various visualization techniques
[Link].6 Perform Data Analysis and representation on a Map using various
Map data sets with Mouse Rollover effect, user interaction, etc

Aim:
To perform Data Analysis and representation on a Map using various Map datasets with
Mouse Rollover effect, user interaction, etc

Algorithm:

• Import the required libraries including folium and pandas.


• Define sample data containing information about cities including their names,
populations, latitudes, and longitudes.
• Create a DataFrame df using the sample data.
• Create a base map m using Folium, centered at the mean latitude and longitude of
the cities, with a zoom level of 4.
• Iterate over each row in the DataFrame and add a marker for each city on the map.
• Set the location of each marker to the latitude and longitude of the city.
• Set the popup content of each marker to display the city name and population.
• Set a tooltip for mouse rollover to display the city name.
• Add a marker cluster to the map using the MarkerCluster plugin from Folium.
• Pass the latitude and longitude columns of the DataFrame to the marker cluster.
• Save the interactive map as an HTML file named interactive_map.html.
Program:

import folium
from folium import plugins
import pandas as pd
# Sample data
data = {
'City': ['New York', 'San Francisco', 'Chicago', 'Los Angeles'],
'Population': [8175133, 884363, 2716000, 3792621],
'Latitude': [40.7128, 37.7749, 41.8781, 34.0522],
'Longitude': [-74.0060, -122.4194, -87.6298, -118.2437],
}
df = [Link](data)

# Create a base map


m = [Link](location=[df['Latitude'].mean(), df['Longitude'].mean()], zoom_start=4)

# Add markers for each city with pop-up information


for i, row in [Link]():
[Link](
location=[row['Latitude'], row['Longitude']],
popup=f"City: {row['City']}\nPopulation: {row['Population']}",
tooltip=row['City'], # Tooltip for mouse rollover
).add_to(m)

# Add a mouseover effect with plugins


[Link](df[['Latitude', 'Longitude']]).add_to(m)

# Save the map


[Link]('interactive_map.html')
Result:
Thus representing a Map using various Map data sets with Mouse
Rollover effect code is executed and verified.
[Link]. 7 Build cartographic visualization for multiple datasets involving
various countries, world states and districts in India etc

Aim:

To Build cartographic visualization for multiple datasets involving various


countries, world states and districts in India etc.

Algorithm:

• Import the required libraries including numpy, pandas, [Link], seaborn, and
geopandas.

• Define the file path (fp) of the shapefile containing the polygon data of India.

• Use the gpd.read_file() function from GeoPandas to read the shapefile into a
GeoDataFrame (map_df).

• Create a copy of the original GeoDataFrame (map_df_copy) to preserve the original


data.

• Use the plot() function on the GeoDataFrame (map_df) to plot the map of India.

• Set the figure size to (12, 12) to adjust the size of the plot.

• Display the map plot.

Program:
Importing libraries

import numpy as np
import pandas as pd
import [Link] as plt
import seaborn as sns
import geopandas as gpd
import shapefile as shp
from [Link] import Point
sns.set_style('whitegrid')

Reading shape file

fp= r'/kaggle/input/map-with-python/[Link]'
map_df = gpd.read_file(fp)
map_df_copy = gpd.read_file(fp)
map_df.head()
Plotting map of India

map_df.plot(figsize=(12,12))

Result:
Thus the Cartographic visualization for india is executed and verified.
[Link]. 8 Perform EDA on Wine Quality Data Set.

Aim:

To perform EDA on Wine Quality Data Set.

Algorithm:

• Import the necessary libraries including numpy, pandas, seaborn, and


[Link].

• Use pd.read_csv() to read the CSV file containing the wine quality dataset into
a DataFrame (df).

• Print the first few rows of the DataFrame using [Link]() to inspect the data.

• Print the shape, information, and descriptive statistics of the DataFrame using
[Link], [Link](), and [Link]().

• Print the unique values of the 'quality' column using [Link]().

• Print the count of each unique value in the 'quality' column using
[Link].value_counts().

• Plot a histogram of the 'quality' column using df['quality'].hist().

• Plot a heatmap of the correlation matrix between features using


[Link]([Link](), cmap='viridis', annot=True).

• Plot boxplots for each feature using [Link]() in a loop for each column.

• Plot distribution plots (histograms with kernel density estimation) for each
feature using [Link]() in a loop for each column.

• Print the skewness and kurtosis of each feature using [Link]() and [Link]().
Program:

import numpy as np # linear algebra


import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import [Link] as plt
[Link]('fivethirtyeight')
df = pd.read_csv('C:/Users/Dell/Downloads/archive/[Link]')
print([Link]())
print([Link])
print([Link]())
print([Link]())
print([Link]())
print([Link].value_counts())
print(df['quality'].hist())
fig, ax = [Link](figsize=(15,7))
print([Link]([Link](),cmap='viridis', annot=True))
l = [Link]
number_of_columns=12
number_of_rows = len(l)-1/number_of_columns
[Link](figsize=(number_of_columns,5*number_of_rows))
for i in range(0,len(l)):
[Link](number_of_rows + 1,number_of_columns,i+1)

sns.set_style('whitegrid')
[Link](df[l[i]],color='green',orient='v')
plt.tight_layout()
[Link](figsize=(2*number_of_columns,5*number_of_rows))
for i in range(0,len(l)):
[Link](number_of_rows + 1,number_of_columns,i+1)
[Link](df[l[i]],kde=True)
print("Skewness \n ",[Link]())
print("\n Kurtosis \n ", [Link]())

Result:

Thus, the EDA has done on wine Quality data set


[Link]. 9 Use case study on a data set and apply the various EDA and
visualization techniques and present analysis report

Aim:

Use a data set and apply the various EDA and visualization techniques and
present analysis report

Algorithm:

• Import required libraries for EDA including Pandas, NumPy, Seaborn, and Matplotlib
for data manipulation, analysis, and visualization.

• Load the car dataset into a DataFrame using Pandas.

• Display the first and last 5 rows of the dataset to get a quick overview.

• Check the data types of columns in the dataset.

• Drop irrelevant columns that are not needed for analysis.

• Rename columns for better readability.

• Drop duplicate rows if any.

• Drop rows with missing or null values.

• Visualize outliers using boxplots for numerical features (e.g., Price, HP, Cylinders).

• Calculate the Interquartile Range (IQR) to identify outliers.

• Remove outliers using the IQR method.

• Plot the number of cars by make using a bar plot to visualize the distribution of car
makes.

• Create a heatmap to visualize the correlation between different numerical features in


the dataset.

• Generate a scatterplot to visualize the relationship between horsepower (HP) and


price.

• Display the generated plots to analyze and interpret the data visually.
Program:
Importing the required libraries for EDA

import pandas as pd
import numpy as np
import seaborn as sns #visualization
import [Link] as plt #visualization
%matplotlib inline
[Link](color_codes=True)

Loading the data into the data frame

df = pd.read_csv("../input/cardataset/[Link]")
# To display the top 5 rows
[Link](5)
[Link](5)

Checking the types of data

[Link]

Dropping irrelevant column

df = [Link](['Engine Fuel Type', 'Market Category', 'Vehicle Style', 'Popularity',


'Number of Doors', 'Vehicle Size'], axis=1)
[Link](5)

Renaming the columns

df = [Link](columns={"Engine HP": "HP", "Engine Cylinders": "Cylinders",


"Transmission Type": "Transmission", "Driven_Wheels": "Drive Mode)
[Link](5)

Dropping the duplicate rows

[Link]
duplicate_rows_df = df[[Link]()]
print("number of duplicate rows: ", duplicate_rows_df.shape)
[Link]()
df = df.drop_duplicates()
[Link](5)
[Link]()
Dropping the missing or null values

print([Link]().sum())
df = [Link]() # Dropping the missing values.
[Link]()
print([Link]().sum())

Detecting Outliers

[Link](x=df['Price'])
[Link](x=df['HP'])
[Link](x=df['Cylinders'])
Q1 = [Link](0.25)
Q3 = [Link](0.75)
IQR = Q3 - Q1 print(IQR)
df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
[Link]

Plot different features against one another (scatter), against frequency (


histogram)
[Link].value_counts().nlargest(40).plot(kind='bar', figsize=(10,5))
[Link]("Number of cars by make")
[Link]('Number of cars')
[Link]('Make');
Heat Maps

[Link](figsize=(10,5))
c= [Link]()
[Link](c,cmap="BrBG",annot=True)
Scatterplot

fig, ax = [Link](figsize=(10,6))
[Link](df['HP'], df['Price'])
ax.set_xlabel('HP')

ax.set_ylabel('Price')
[Link]()

Result:
Thus, the various EDA is applied and visualization techniques
and presented analysis report
[Link]. 9 TIME SERIES ANALYSIS

Aim:
To demonstrate basic time series analysis and visualization techniques using synthetic time
series data. The program generates synthetic time series data, performs basic analysis to understand
the data's structure, and visualizes the data using line plots, histograms, and trend analysis.
Algorithm:
• Define the date range for the time series data, specifying the start date, end date, and frequency
of observations.
• Generate random values to represent the time series data. In this case, we'll use a random walk
pattern where each value is the cumulative sum of random Gaussian noise.
• Create a DataFrame to store the synthetic time series data, with columns for dates and
corresponding values.
• Set the date column as the index of the DataFrame to facilitate time-based indexing and
visualization.
• Print the first few rows of the synthetic dataset to provide an overview of the data structure.
• Plot the synthetic time series data using Matplotlib, with dates on the x-axis and corresponding
values on the y-axis.
• Visualize the distribution of values in the synthetic time series data using a histogram. This plot
helps understand the spread and shape of the data distribution.
• Compute the rolling mean of the time series data to visualize the trend over time.
• Plot the original time series data along with the rolling mean trend line to observe any
underlying trends.
• Display the generated plots to the user for visual analysis and interpretation.

PROGRAM :
import pandas as pd
import numpy as np
import [Link] as plt
import seaborn as sns
# Generate synthetic time series data
[Link](0)
date_range = pd.date_range(start='2022-01-01', end='2022-12-31', freq='D')
values = [Link](len(date_range)).cumsum()
df = [Link]({'Date': date_range, 'Value': values})
# Set the date column as the index
df.set_index('Date', inplace=True)
# Display the first few rows of the dataset
print("Sample Data:")
print([Link]())

# Plot the time series data


[Link](figsize=(10, 6))
[Link]([Link], df['Value'], color='blue', marker='o', linestyle='-')
[Link]('Time Series Data')
[Link]('Date')
[Link]('Value')
[Link](True)
[Link]()

# Visualize the distribution of the values over time


[Link](figsize=(10, 6))
[Link](df['Value'], bins=20, kde=True, color='green')
[Link]('Distribution of Values')
[Link]('Value')
[Link]('Frequency')
[Link](True)
[Link]()

# Visualize the trend using a rolling mean


[Link](figsize=(10, 6))
rolling_mean = df['Value'].rolling(window=30).mean() # Adjust window size as needed
[Link]([Link], df['Value'], color='blue', label='Original')
[Link]([Link], rolling_mean, color='red', label='Rolling Mean')
[Link]('Trend Analysis')
[Link]('Date')
[Link]('Value')
[Link]()
[Link](True)
[Link]()

RESULT:
Thus, the time series analysis and visualization has done for time series data.

You might also like