DEV Lab Manual Student
DEV Lab Manual Student
1 Install the data Analysis and Visualization tool: R/ Python /Tableau Public/
Power BI
Aim:
To install the data Analysis and Visualization tool: R/ Python /Tableau Public/
Power BI.
Steps to install python in desktop:
1. Open the official Python website in your web browser. Navigate to the
Downloads tab for Windows.
2. Choose the latest Python 3 release. In our example, we choose the latest Python
3.7.3 version.
3. Click on the link to download Windows x86 executable installer if you are using a
32-bit installer. In case your Windows installation is a 64-bit system, then
download Windows x86-64 executable installer.
The last (optional) step in the installation process is to add Python Path to the System
Environment variables. This step is done to access Python through the command line. In
case you have added Python to environment variables while setting the Advanced options
during the installation procedure, you can avoid this step. Else, this step is done manually
as follows. In the Start menu, search for “advanced system settings”. Select “View
advanced system settings”. In the “System Properties” window, click on the “Advanced”
tab and then click on the “Environment Variables” button. Locate the Python installation
directory on your system. If you followed the steps exactly as above, python will be
installed in below locations:
The folder name may be different from “Python37-32” if you installed a different
version. Look for a folder whose name starts with Python. Append the following
entries to PATH variable
You have now successfully installed Python 3.7.3 on Windows 10. You can verify if the Python
installation is successful either through the command line or through the IDLE app that gets
installed along with the installation. Search for the command prompt and type “python”. You can
see that Python 3.7.3 is successfully installed.
Result:
The data Analysis and Visualization tool: R/ Python /Tableau Public/ Power BI is installed
and verified .
[Link]. 2 Perform exploratory data analysis (EDA) on with datasets like email data set.
Export all your emails as a dataset, import them inside a pandas data frame, visualize them
and get different insights from the data
Aim:
To perform exploratory data analysis(EDA) with email data set
Algorithm:
• Algorithm: Define the date range for the time series data, specifying the start date, end date, and
frequency of observations.
• Generate random values to represent the time series data. In this case, we'll use a random walk
pattern where each value is the cumulative sum of random Gaussian noise.
• Create a DataFrame to store the synthetic time series data, with columns for dates and
corresponding values.
• Set the date column as the index of the DataFrame to facilitate time-based indexing and
visualization.
• Print the first few rows of the synthetic dataset to provide an overview of the data structure.
• Plot the synthetic time series data using Matplotlib, with dates on the x-axis and corresponding
values on the y-axis.
• Visualize the distribution of values in the synthetic time series data using a histogram. This plot
helps understand the spread and shape of the data distribution.
• Compute the rolling mean of the time series data to visualize the trend over time.
• Plot the original time series data along with the rolling mean trend line to observe any
underlying trends.
• Display the generated plots to the user for visual analysis and interpretation.
Program:
Import csv
with open('[Link]', 'w') as outputfile:
writer = [Link](outputfile)
[Link](['subject','from','date','to','label','thread'])
import datetime
import pytz
def refactor_timezone(x):
est = [Link]('US/Eastern')
return [Link](est)
dfs['date'] = dfs['date'].apply(lambda x: refactor_timezone(x))
dfs['dayofweek'] = dfs['date'].apply(lambda x: x.weekday_name)
dfs['dayofweek'] = [Link](dfs['dayofweek'],
categories=['Monday', 'Tuesday',
wednesday', 'Thursday', 'Friday','Saturday', 'Sunday'], ordered=True)
dfs['timeofday'] = dfs['date'].apply(lambda x: [Link] + [Link]/60 +
[Link]/3600)
dfs['hour'] = dfs['date'].apply(lambda x: [Link])
dfs['year_int'] = dfs['date'].apply(lambda x: [Link])
dfs['year'] = dfs['date'].apply(lambda x: [Link] + [Link]/365.25)
[Link] = dfs['date']
del dfs['date']
print([Link]().strftime('%a, %d %b %Y %I:%M %p'))
print([Link]().strftime('%a, %d %b %Y %I:%M %p'))
tod = df[df['timeofday'].notna()]['timeofday'].values
year = df[df['year'].notna()]['year'].values
Ty = [Link]() - [Link]()
T = [Link]() - [Link]()
bins = int(T / dt)
if weight_fun is None:
weights = 1 / (np.ones_like(tod) * Ty * 365.25 / dt)
else:
weights = weight_fun(df)
if smooth:
hst, xedges = [Link](tod, bins=bins, weights=weights);
x = [Link](xedges, -1) + 0.5*(xedges[1] - xedges[0])
hst = ndimage.gaussian_filter(hst, sigma=0.75)
f = interp1d(x, hst, kind='cubic')
x = [Link]([Link](), [Link](), 10000)
hst = f(x)
[Link](x, hst, label=label, **plot_kwargs)
else:
[Link](tod, bins=bins, weights=weights, label=label, **plot_kwargs);
[Link](ls=':', color='k')
orientation = plot_kwargs.get('orientation')
if orientation is None or orientation == 'vertical':
ax.set_xlim(0, 24)
[Link].set_major_locator(MaxNLocator(8))
ax.set_xticklabels([[Link](str(int([Link](ts,
24))), "%H").strftime("%I %p")
for ts in ax.get_xticks()]);
elif orientation == 'horizontal':
ax.set_ylim(0, 24)
[Link].set_major_locator(MaxNLocator(8))
ax.set_yticklabels([[Link](str(int([Link](ts, 24))),
"%H").strftime("%I %p")
for ts in ax.get_yticks()]);
class TriplePlot:
def init (self):
gs = [Link](6, 6)
self.ax1 = [Link](gs[2:6, :4])
self.ax2 = [Link](gs[2:6, 4:6], sharey=self.ax1)
[Link](self.ax2.get_yticklabels(), visible=False);
self.ax3 = [Link](gs[:2, :4])
[Link](self.ax3.get_xticklabels(), visible=False);
def plot(self, df, color='darkblue', alpha=0.8, markersize=0.5,
yr_bin=0.1, hr_bin=0.5):
plot_todo_vs_year(df, self.ax1, color=color, s=markersize)
plot_number_perdhour_per_year(df, self.ax2, dt=hr_bin, color=color,
alpha=alpha, orientation='horizontal')
self.ax2.set_xlabel('Average emails per hour')
plot_number_perday_per_year(df, self.ax3, dt=yr_bin, color=color,
alpha=alpha)
self.ax3.set_ylabel('Average emails per day')
import [Link] as gridspec
import [Link] as mpatches
[Link](figsize=(12,12));
tpl = TriplePlot()
counts = [Link].value_counts(sort=False)
[Link](kind='bar')
addrs = received['from'].value_counts()
addrs[0:4]
[Link](figsize=(12,12));
tpl = TriplePlot()
labels = []
colors = ['C{}'.format(ii) for ii in range(9)]
idx = [Link]([1,2,3,7])
for ct, addr in enumerate([Link][idx]):
[Link](dfs[dfs['from'] == addr], color=colors[ct], alpha=0.3,
yr_bin=0.5, markersize=1.0)
[Link]([Link](color=colors[ct], label=addrs[0:4],
alpha=0.5))
[Link](handles=labels, bbox_to_anchor=[1.4, 0.9], fontsize=12,
shadow=True);
[Link](figsize=(8,5))
ax = [Link](111)
for ct, dow in enumerate([Link]):
df_r = received[received['dayofweek']==dow]
weights = [Link](len(df_r)) / len(received)
wfun = lambda x: weights
plot_number_perdhour_per_year(df_r, ax, dt=1, smooth=True,
color=f'C{ct}',
alpha=0.8, lw=3, label=dow, weight_fun=wfun)
df_s = sent[sent['dayofweek']==dow]
weights = [Link](len(df_s)) / len(sent)
wfun = lambda x: weights
plot_number_perdhour_per_year(df_s, ax, dt=1, smooth=True,
color=f'C{ct}',
alpha=0.8, lw=2, label=dow, ls='--', weight_fun=wfun)
ax.set_ylabel('Fraction of weekly emails per hour')
[Link](loc='upper left')
[Link](figsize=(25,15))
[Link](wordcloud, interpolation='bilinear')
[Link]("off")
[Link](x=0, y=0)
Result:
Thus, the exploratory data analysis with email data set is
execute and verified
[Link]. 3 Working with Numpy arrays, Pandas data frames , Basic plot
using matplotlib
Aim:
To work with Numpy arrays, Pandas data frames , Basic plots using Matplotlib.
Packages:
#Numpy:
# Import module
import numpy as np
# Creating the array
sample_array_1 = [Link]([[0, 4, 2]])
sample_array_2 = [Link]([0.2, 0.4, 2.4])
# display data type
print("Data type of the array 1 :",sample_array_1.dtype)
print("Data type of array 2 :",sample_array_2.dtype)
#Pandas Frame:
Pandas DataFrame is a two-dimensional size-mutable, potentially
heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is
a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and
columns. Pandas DataFrame consists of three principal components, the data, rows, and
columns
Row selection : Pandas provide a unique method to retrieve rows from a Data
frame. Dataframe loc[ ] method is used to retrieve rows from Pandas DataFrame.
Rows can also be selected by passing integer location to an iloc[ ]function.
Import pandas as pd
Data = pd.read_csv("[Link]", index_col ="Name")
first = [Link]["Avery Bradley"]
second = [Link]["R.J. Hunter"]
print(first, "\n\n\n", second)
This function selects data by the label of the rowsand columns. The
[Link] indexer
selects data in a different way than justthe indexing operator. It can select
subsets of rows or columns. It can also simultaneously select subsets of rows and
columns.
import pandas as pd
data=pd.read_csv("[Link]", index="Name")
first = [Link]["Avery Bradle
second = [Link]["R.J. Hunter"]
print(first, "\n\n\n", second)
import pandas as pd
# making data frame from csv file
data=pd.read_csv("[Link]",
index_col="Name")
# retrieving rows by iloc method
row2 = [Link][3]
print(row2)
#Matplotlib
Basic plots in matplotlib:
Matplotlib comes with a wide variety of [Link] helps to understand trends, patterns,
and to make correlations. They’re typically instruments for reasoning about quantitative
information.
Line Plot:
from matplotlib import pyplot as plt
x = [5, 2, 9, 4, 7]
y = [10, 5, 8, 4, 2]
[Link](x,y)
[Link]()
Bar plot:
from matplotlib import pyplot as plt
x = [5, 2, 9, 4, 7]
y = [10, 5, 8, 4, 2]
[Link](x,y)
[Link]()
Histogram:
from matplotlib import pyplot as plt
y = [10, 5, 8, 4, 2]
[Link](y)
[Link]()
Scatter plot:
Result:
Thus, worked with Numpy arrays, Pandas data frames , Basic
Plot using matplotlib is done
[Link]. 4 Explore various variables and row filters in R for cleaning data .
Apply various plot features in R on sample data sets and visualize
Aim:
To explore various variables and row filters in R for cleaning data . Apply various
plot features in R on sample data sets and visualize
Algorithm:
• Assign numeric, integer, character, logical, and complex values to variables.
• Perform basic arithmetic operations such as addition, subtraction, multiplication,
division, exponentiation, and modulus.
• Create sequences using the : operator and the seq() function.
• Create vectors of different types including numeric, integer, character, and logical.
• Create matrices using the matrix() function.
• Combine vectors into matrices using cbind() function.
• Create lists containing different types of data.
• Create data frames using the [Link]() function.
• Read datasets from CSV files using the [Link]() function.
• Perform basic operations on datasets such as finding the mean of a column.
• Define functions without arguments and with arguments.
• Demonstrate the use of the return statement in functions.
• Plot different types of charts including bar plots, dot charts, pie charts, histograms, and
box plots using functions like plot(), barplot(), dotchart(), pie(), hist(), and boxplot().
Program:
# Basic Operation in R
a1 = 2>1#greater than
a2 = 2<1#lesser than
a3 = 2<= 2 #lesser than or equal to
a4 = 2>=1 #greater than at equator
a5 = 2!=2#not equal to
a6 = 12 == 12 #equal too
#sequence
#possibility : 1
seq = 1:10
#possibility
seq 1 = seq(1,10)
#Data Structures
#Vectors
num_vec = c(12,21,12)
int_vec = c(12L,12L,12L)
char_vec = c('mohan','gopi','lucifer')
bool_vec = c(T,T,F)
mix_vec = c('mohan',T,2,2L)
#character,numeric,integer,logical
#matrics
matric = matrix(1:10,nrow=2,ncol=10)
#combine two vectors in matrices
c_bind = cbind(num_vec,char_vec)
#store different types of data
diff_data = list(c(1,2,3),'modafinil',1:10)
#dataframes creation
plain_df = [Link](names = c('mohan','gopi'),
[Link] = c(2,12))
#read a dataset like csv file
#dataset <- [Link]([Link]())
dataset = [Link]("/Users/91965/Documents/supermarket_sales -
[Link]")
#finding mean of a column
mean = mean(dataset[,7])
#conditional statement (if,else,elseif)
x=12
y=25
#if condition if(condition){expression}
if(x==y){
print("2nd year AI&DS are good")}
#if else condition if(condition){expression}else{expression}
if(x>=y){
print("hello world")
}else{
print("hello hell")
}
if(x>=y){print("parley")
}else if(x<=y){
print("kolo")
}
#for loop
#for(condition){expression}
for(i in 1:5){print("hell")
print("mohan")}
vec = c(21,12,12)
for(value in c_bind){print(value)}
#while loop
#while (condition){expression}
value =1
while (value<5){print("hello")
value = value +1}
#infinity while loop
#functions
#without argument
m = function(){print("mohan")}
#with argument
g = function(arg){arg**2}
l = function(arg1,arg2){
loc = arg1+arg2
return(loc)
}
various plots in r
#bar plots
#plot(iris$[Link],iris$[Link] ,type = "l")
barplot(iris$[Link] ,ylim = c(0,8) xlab="iris-sepal-length",col=c(2,3,7) ,
main=”BAR PLOT”)
#dotcharts
dotchart(iris$[Link],ylim = c(0,100),xlab = "iris-sepal-width", main="
DOTCHARTS",pch=2)
#pie charts
pie(iris$[Link], col = rainbow(9),main = "PIE CHARTS")
#histogram
hist(irs$[Link],xlab ="Petal-Width",main="HISTOGRAM",col=rainbow(9))
#boxplot
boxplot(iris$[Link],iris$[Link],names = c("Petal-Length","
"Petal-Width"),main='BOX PLOT',col = rainbow(9))
Result:
Thus the various variables and row filters in R for cleaning data is executed and applied
various plot features in R on sample data setsvisualized
[Link]. 5 Perform Time Series Analysis and apply the various visualization
techniques
Aim:
To perform Time Series Analysis and apply the various visualization techniques
Algorithm:
• Define a vector x containing weekly COVID-19 positive cases data.
• Import the lubridate library to utilize the decimal_date() function for date
conversion.
• Create a time series object mts using the ts() function with the appropriate start
date and frequency.
• Plot the time series graph using the plot() function with labels and title.
• Save the plot as a PNG file using png() and [Link]() functions.
Program:
Result:
Thus performed an Time Series Analysis and applied the
various visualization techniques
[Link].6 Perform Data Analysis and representation on a Map using various
Map data sets with Mouse Rollover effect, user interaction, etc
Aim:
To perform Data Analysis and representation on a Map using various Map datasets with
Mouse Rollover effect, user interaction, etc
Algorithm:
import folium
from folium import plugins
import pandas as pd
# Sample data
data = {
'City': ['New York', 'San Francisco', 'Chicago', 'Los Angeles'],
'Population': [8175133, 884363, 2716000, 3792621],
'Latitude': [40.7128, 37.7749, 41.8781, 34.0522],
'Longitude': [-74.0060, -122.4194, -87.6298, -118.2437],
}
df = [Link](data)
Aim:
Algorithm:
• Import the required libraries including numpy, pandas, [Link], seaborn, and
geopandas.
• Define the file path (fp) of the shapefile containing the polygon data of India.
• Use the gpd.read_file() function from GeoPandas to read the shapefile into a
GeoDataFrame (map_df).
• Use the plot() function on the GeoDataFrame (map_df) to plot the map of India.
• Set the figure size to (12, 12) to adjust the size of the plot.
Program:
Importing libraries
import numpy as np
import pandas as pd
import [Link] as plt
import seaborn as sns
import geopandas as gpd
import shapefile as shp
from [Link] import Point
sns.set_style('whitegrid')
fp= r'/kaggle/input/map-with-python/[Link]'
map_df = gpd.read_file(fp)
map_df_copy = gpd.read_file(fp)
map_df.head()
Plotting map of India
map_df.plot(figsize=(12,12))
Result:
Thus the Cartographic visualization for india is executed and verified.
[Link]. 8 Perform EDA on Wine Quality Data Set.
Aim:
Algorithm:
• Use pd.read_csv() to read the CSV file containing the wine quality dataset into
a DataFrame (df).
• Print the first few rows of the DataFrame using [Link]() to inspect the data.
• Print the shape, information, and descriptive statistics of the DataFrame using
[Link], [Link](), and [Link]().
• Print the count of each unique value in the 'quality' column using
[Link].value_counts().
• Plot boxplots for each feature using [Link]() in a loop for each column.
• Plot distribution plots (histograms with kernel density estimation) for each
feature using [Link]() in a loop for each column.
• Print the skewness and kurtosis of each feature using [Link]() and [Link]().
Program:
sns.set_style('whitegrid')
[Link](df[l[i]],color='green',orient='v')
plt.tight_layout()
[Link](figsize=(2*number_of_columns,5*number_of_rows))
for i in range(0,len(l)):
[Link](number_of_rows + 1,number_of_columns,i+1)
[Link](df[l[i]],kde=True)
print("Skewness \n ",[Link]())
print("\n Kurtosis \n ", [Link]())
Result:
Aim:
Use a data set and apply the various EDA and visualization techniques and
present analysis report
Algorithm:
• Import required libraries for EDA including Pandas, NumPy, Seaborn, and Matplotlib
for data manipulation, analysis, and visualization.
• Display the first and last 5 rows of the dataset to get a quick overview.
• Visualize outliers using boxplots for numerical features (e.g., Price, HP, Cylinders).
• Plot the number of cars by make using a bar plot to visualize the distribution of car
makes.
• Display the generated plots to analyze and interpret the data visually.
Program:
Importing the required libraries for EDA
import pandas as pd
import numpy as np
import seaborn as sns #visualization
import [Link] as plt #visualization
%matplotlib inline
[Link](color_codes=True)
df = pd.read_csv("../input/cardataset/[Link]")
# To display the top 5 rows
[Link](5)
[Link](5)
[Link]
[Link]
duplicate_rows_df = df[[Link]()]
print("number of duplicate rows: ", duplicate_rows_df.shape)
[Link]()
df = df.drop_duplicates()
[Link](5)
[Link]()
Dropping the missing or null values
print([Link]().sum())
df = [Link]() # Dropping the missing values.
[Link]()
print([Link]().sum())
Detecting Outliers
[Link](x=df['Price'])
[Link](x=df['HP'])
[Link](x=df['Cylinders'])
Q1 = [Link](0.25)
Q3 = [Link](0.75)
IQR = Q3 - Q1 print(IQR)
df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
[Link]
[Link](figsize=(10,5))
c= [Link]()
[Link](c,cmap="BrBG",annot=True)
Scatterplot
fig, ax = [Link](figsize=(10,6))
[Link](df['HP'], df['Price'])
ax.set_xlabel('HP')
ax.set_ylabel('Price')
[Link]()
Result:
Thus, the various EDA is applied and visualization techniques
and presented analysis report
[Link]. 9 TIME SERIES ANALYSIS
Aim:
To demonstrate basic time series analysis and visualization techniques using synthetic time
series data. The program generates synthetic time series data, performs basic analysis to understand
the data's structure, and visualizes the data using line plots, histograms, and trend analysis.
Algorithm:
• Define the date range for the time series data, specifying the start date, end date, and frequency
of observations.
• Generate random values to represent the time series data. In this case, we'll use a random walk
pattern where each value is the cumulative sum of random Gaussian noise.
• Create a DataFrame to store the synthetic time series data, with columns for dates and
corresponding values.
• Set the date column as the index of the DataFrame to facilitate time-based indexing and
visualization.
• Print the first few rows of the synthetic dataset to provide an overview of the data structure.
• Plot the synthetic time series data using Matplotlib, with dates on the x-axis and corresponding
values on the y-axis.
• Visualize the distribution of values in the synthetic time series data using a histogram. This plot
helps understand the spread and shape of the data distribution.
• Compute the rolling mean of the time series data to visualize the trend over time.
• Plot the original time series data along with the rolling mean trend line to observe any
underlying trends.
• Display the generated plots to the user for visual analysis and interpretation.
PROGRAM :
import pandas as pd
import numpy as np
import [Link] as plt
import seaborn as sns
# Generate synthetic time series data
[Link](0)
date_range = pd.date_range(start='2022-01-01', end='2022-12-31', freq='D')
values = [Link](len(date_range)).cumsum()
df = [Link]({'Date': date_range, 'Value': values})
# Set the date column as the index
df.set_index('Date', inplace=True)
# Display the first few rows of the dataset
print("Sample Data:")
print([Link]())
RESULT:
Thus, the time series analysis and visualization has done for time series data.