EX.
NO:1 Install the Data Analysis and Visualization Tool: R /
DATE: Python / Tableau Public / Power BI
AIM:
To install and set up a data analysis and visualization environment using
Python (with Jupyter Notebook), or other tools like R, Tableau Public, or Power
BI, enabling the user to perform data analysis, visualization, and basic data
science operations.
ALGORITHM / PROCEDURE:
A. Installation of Python (Anaconda Distribution)
1. Download Anaconda:
Visit https://www.anaconda.com/products/distribution
Choose the installer for your operating system (Windows / macOS /
Linux).
2. Install Anaconda:
Run the installer and follow on-screen instructions.
Select the option to add Anaconda to the system PATH.
3. Launch Jupyter Notebook:
Open Anaconda Navigator → Click on Jupyter Notebook.
Create a new Python notebook (.ipynb).
4. Verify Installation:
Import key data science libraries like NumPy, Pandas, and Matplotlib.
B. Installation of R (optional alternative)
1. Download and install R from https://cran.r-project.org/.
2. Download and install RStudio (IDE) from
https://posit.co/download/rstudio-desktop/.
3. Launch RStudio and test with a simple script.
C. Installation of Tableau Public / Power BI
Tableau Public: Download from https://public.tableau.com/en-us/s/ and
install.
Power BI: Download from Microsoft Store or
https://powerbi.microsoft.com/.
Load a sample dataset and verify visualization functionality.
RESULT:
The data analysis and visualization environment was successfully
installed and configured using Python (Anaconda Distribution).
Basic data science libraries such as NumPy, Pandas, Matplotlib, and
Seaborn were verified and tested with a sample visualization.
EX.NO:2
Perform Exploratory Data Analysis (EDA) with an
DATE: Email Dataset — Import, Visualize, and Derive
Insights
AIM:
To perform Exploratory Data Analysis (EDA) on an email dataset by
importing exported email data into a Pandas DataFrame, cleaning and
exploring it, visualizing patterns, and deriving meaningful insights about the
data.
ALGORITHM / PROCEDURE:
Data Collection:
Export email data (e.g., from Gmail using Google Takeout or from
Outlook) in a .csv or .xlsx format.
The dataset may include columns like Sender, Receiver, Subject, Date,
Time, Message Length, Folder (Inbox/Sent/Spam), etc.
Import the Dataset:
Load the dataset into a Pandas DataFrame using
pd.read_csv() or pd.read_excel().
Data Cleaning:
Check for missing values, duplicates, and invalid entries.
Convert date/time columns to datetime format.
Exploratory Data Analysis (EDA):
Display dataset information (.info(), .describe()).
Check distributions and relationships between variables.
Analyze patterns such as:
Number of emails sent/received per day.
Most frequent senders/receivers.
Common keywords in subjects.
Visualization:
Use Matplotlib and Seaborn for data visualization.
Create plots such as:
Bar charts for most frequent senders.
Line charts for emails per day.
Pie chart for email categories.
Word cloud for subject keywords (optional).
Derive Insights:
Summarize findings from EDA and visualizations
PROGRAM :
# Importing required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
# Step 1: Load the dataset
data = pd.read_csv('emails.csv')
# Step 2: Display basic info
print("Dataset Info:")
print(data.info())
print("\nFirst 5 Records:")
print(data.head())
# Step 3: Data Cleaning
data.drop_duplicates(inplace=True)
data['Date'] = pd.to_datetime(data['Date'], errors='coerce')
# Step 4: Basic EDA
print("\nSummary Statistics:")
print(data.describe())
# Number of emails per day
emails_per_day = data['Date'].dt.date.value_counts().sort_index()
# Step 5: Visualization
# 1. Emails per day
plt.figure(figsize=(10,5))
emails_per_day.plot(kind='line', color='blue')
plt.title('Number of Emails per Day')
plt.xlabel('Date')
plt.ylabel('Count')
plt.grid(True)
plt.show()
# 2. Top 10 Senders
plt.figure(figsize=(10,5))
top_senders = data['From'].value_counts().head(10)
sns.barplot(x=top_senders.index, y=top_senders.values, palette='viridis')
plt.title('Top 10 Senders')
plt.xlabel('Sender')
plt.ylabel('Email Count')
plt.xticks(rotation=45)
plt.show()
# 3. Word Cloud for Subject Keywords
text = " ".join(str(subj) for subj in data['Subject'].dropna())
wordcloud = WordCloud(width=800, height=400,
background_color='white').generate(text)
plt.figure(figsize=(10,5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Most Common Words in Email Subjects')
plt.show()
OUTPUT :
1. Dataset Info
Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 1000 non-null datetime64[ns]
1 From 1000 non-null object
2 To 1000 non-null object
3 Subject 980 non-null object
4 Folder 1000 non-null object
dtypes: datetime64 , object(4)
2. First 5 Record
Date From To Subject Folder
3. Summary Statistics
Summary Statistics:
Date
count 1000
unique 300
top 2023-07-15
freq 15
Name: Date, dtype: object
RESULT:
Exploratory Data Analysis (EDA) was successfully performed on the email
dataset.
Different visualizations helped identify:
The most frequent senders and active dates.
Distribution of emails over time.
Common keywords in email subjects.
This experiment demonstrates how EDA techniques reveal trends and
patterns in textual and time-based data.
EX.NO:3
Working with NumPy Arrays, Pandas
DATE: DataFrames, and Basic Plots using Matplotlib
AIM:
To understand and demonstrate the creation, manipulation, and analysis
of NumPy arrays and Pandas DataFrames, and to visualize data using basic
Matplotlib plots such as line plots, bar charts, histograms, and scatter plots.
ALGORITHM / PROCEDURE:
Import Required Libraries
Import the numpy, pandas, and matplotlib.pyplot libraries.
Create and Manipulate NumPy Arrays
Create 1D and 2D arrays using np.array().
Perform basic operations: addition, multiplication, slicing, and reshaping.
Create and Explore Pandas DataFrames
Create a DataFrame using a dictionary or CSV file.
Display rows and columns using .head(), .tail(), .info(),
.describe().
Perform column operations and basic statistical analysis .
Visualize Data Using Matplotlib
Create different types of plots:
Line plot
Bar chart
Histogram
Scatter plot
Add titles, labels, legends, and grid for better readability.
Interpret the Output
Analyze the visualizations and draw conclusions.
PROGRAM :
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Part 1: Working with NumPy
# Creating arrays
arr1 = np.array([10, 20, 30, 40, 50])
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print("1D Array:", arr1)
print("2D Array:\n", arr2)
# Array operations
print("Array Sum:", arr1.sum())
print("Array Mean:", arr1.mean())
print("Array Slicing:", arr1[1:4])
print("Reshaped 2D Array:\n", arr2.reshape(3, 2))
# Part 2: Working with Pandas
# Creating a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [24, 27, 22, 32, 29],
'Marks': [88, 92, 95, 70, 85]
df = pd.DataFrame(data)
print("\nDataFrame:\n", df)
# Display basic info
print("\nDataFrame Info:")
print(df.info())
print("\nStatistical Summary:\n", df.describe())
# Part 3: Data Visualization using Matplotlib
# Line Plot
plt.figure(figsize=(6,4))
plt.plot(df['Name'], df['Marks'], marker='o', color='green')
plt.title('Line Plot - Student Marks')
plt.xlabel('Name')
plt.ylabel('Marks')
plt.grid(True)
plt.show()
# Bar Chart
plt.bar(df['Name'], df['Age'], color='orange')
plt.title('Bar Chart - Student Ages')
plt.xlabel('Name')
plt.ylabel('Age')
plt.show()
# Histogram
plt.hist(df['Marks'], bins=5, color='skyblue', edgecolor='black')
plt.title('Histogram - Marks Distribution')
plt.xlabel('Marks')
plt.ylabel('Frequency')
plt.show()
# Scatter Plot
plt.scatter(df['Age'], df['Marks'], color='red')
plt.title('Scatter Plot - Age vs Marks')
plt.xlabel('Age')
plt.ylabel('Marks')
plt.grid(True)
plt.show()
OUTPUT:
1. NumPy Array Output
1D Array: [10 20 30 40 50]
2D Array:
[[1 2 3]
[4 5 6]]
Array Sum: 150
Array Mean: 30.0
Array Slicing: [20 30 40]
Reshaped 2D Array:
[[1 2]
[3 4]
[5 6]]
2. Pandas DataFrame Output
Name Age Marks
0 Alice 24 88
1 Bob 27 92
2 Charlie 22 95
3 David 32 70
4 Eva 29 85
RESULT:
Successfully demonstrated the creation and manipulation of NumPy arrays and
Pandas DataFrames, along with visualization of data using Matplotlib.
The experiment highlights how Python’s core data science libraries enable
numerical computation, tabular data handling, and effective graphical
representation of information.
EX.NO:4
Data Cleaning and Visualization, Exploring
DATE: Filters and Plot Features in R
AIM:
To understand and demonstrate how to filter variables (columns) and rows in R
for data cleaning purposes, and to apply various plotting features on sample
datasets for visual analysis.
ALGORITHM / PROCEDURE:
Load Required Libraries
Use the tidyverse package for data manipulation (dplyr) and visualization
(ggplot2).
Load Sample Dataset
Use built-in datasets such as mtcars, iris, or diamonds.
Explore and Inspect Data
View the structure, summary, and first few rows of the dataset using functions like
head(), str(), summary().
Apply Variable (Column) Filters
Select specific columns using select() function from dplyr.
Remove unnecessary columns for cleaning.
Apply Row Filters
Use filter() function to include or exclude rows based on specific conditions.
Example: select only cars with mpg > 20 from mtcars dataset
Handle Missing or Invalid Data (if any)
Use na.omit() or is.na() to remove or identify missing values.
Visualization using ggplot2
Create various plots using ggplot():
o Bar Plot
o Histogram
o Scatter Plot
o Box Plot
Add titles, axis labels, and color themes for clarity.
Analyze and Interpret the Output
PROGRAM (R):
# Load required libraries
library(dplyr)
library(ggplot2)
# Step 1: Load sample dataset
data("mtcars")
# Step 2: Explore dataset
head(mtcars)
str(mtcars)
summary(mtcars)
# Step 3: Variable (Column) Filtering
selected_data <- select(mtcars, mpg, cyl, hp, wt)
print("Selected Columns:")
print(head(selected_data))
# Step 4: Row Filtering
filtered_data <- filter(selected_data, mpg > 20)
print("Filtered Rows where mpg > 20:")
print(filtered_data)
# Step 5: Handling Missing Values (if any)
cleaned_data <- na.omit(filtered_data)
# Step 6: Visualization
# (a) Scatter Plot - Horsepower vs. MPG
ggplot(cleaned_data, aes(x = hp, y = mpg)) +
geom_point(color = "blue", size = 3) +
ggtitle("Scatter Plot: Horsepower vs. Mileage (mpg)") +
xlab("Horsepower (hp)") +
ylab("Miles per Gallon (mpg)") +
theme_minimal()
# (b) Bar Plot - Count of Cylinders
ggplot(cleaned_data, aes(x = factor(cyl))) +
geom_bar(fill = "orange") +
ggtitle("Bar Plot: Number of Cars by Cylinders") +
xlab("Cylinders") +
ylab("Count") +
theme_bw()
# (c) Box Plot - Weight vs. MPG
ggplot(cleaned_data, aes(x = factor(cyl), y = wt, fill = factor(cyl))) +
geom_boxplot() +
ggtitle("Box Plot: Weight Distribution by Cylinders") +
xlab("Cylinders") +
ylab("Weight (wt)") +
theme_classic()
OUTPUT:
1. Data Inspection
'data.frame': 32 obs. of 11 variables:
$ mpg: Miles per gallon (numeric)
$ cyl: Number of cylinders (numeric)
$ disp: Displacement (numeric)
$ hp : Horsepower (numeric)
$ drat: Rear axle ratio (numeric)
$ wt : Weight (numeric)
2. Filtered Data Example
Selected Columns:
mpg cyl hp wt
1 21.0 6 110 2.620
2 21.0 6 110 2.875
3 22.8 4 93 2.320
4 24.4 4 62 2.200
5 22.8 4 95 2.320
Filtered Rows (mpg > 20):
mpg cyl hp wt
1 21.0 6 110 2.620
2 21.0 6 110 2.875
3 22.8 4 93 2.320
4 24.4 4 62 2.200
RESULT:
Successfully explored row and variable filtering techniques in R for data
cleaning using dplyr functions such as select() and filter().
Additionally, different visualization techniques were applied using ggplot2,
providing insights into relationships among key variables.
The experiment demonstrates the use of R as a powerful tool for data wrangling
and visualization.
EX.NO:5
Perform Time Series Analysis and Apply
DATE: Various Visualization Techniques
AIM:
To perform time series analysis on a dataset and apply various visualization
techniques to understand trends, seasonality, and patterns using R.
ALGORITHM :
Start the R environment (RStudio or R GUI).
Load the necessary libraries for time series analysis and visualization:
ggplot2 for plotting.
forecast for time series analysis and forecasting.
tseries for additional time series functions.
Import or load a time series dataset:
Use a built-in dataset like AirPassengers, or import your own dataset
using read.csv().
Convert the dataset into a time series object using the ts() function if it is
not already in time series format.
Display the dataset information:
Use functions such as head(), summary(), and str() to understand
the structure and statistics of the data.
Plot the original time series using the plot() function to visualize the overall
trend and fluctuations over time.
Decompose the time series into its components:
Use the decompose() or stl() functions to separate the data into
trend, seasonal, and random (residual) components.
Plot the decomposed components to study their behavior.
Apply visualization techniques using ggplot2 and forecast package
functions:
autoplot() for enhanced plots.
ggseasonplot() to visualize seasonal patterns across years.
ggsubseriesplot() to identify monthly or periodic variations.
(Optional) Perform forecasting:
Use auto.arima() or ets() to fit forecasting models.
Predict future values using the forecast() function.
Visualize the forecast with autoplot().
Interpret the visualizations to identify:
Trends (increasing or decreasing behavior over time).
Seasonality (repeating patterns at regular intervals).
Residuals (random fluctuations).
Stop the program and record the observations and results.
PROGRAM (in R):
# Load required libraries
library(ggplot2)
library(forecast)
library(tseries)
# Load a sample time series dataset
data("AirPassengers")
# Display dataset information
print(head(AirPassengers))
summary(AirPassengers)
# Basic time series plot
plot(AirPassengers,
main = "Monthly Air Passengers Data",
ylab = "Number of Passengers",
xlab = "Year",
col = "blue")
# Decompose the time series
decomposed <- decompose(AirPassengers)
plot(decomposed)
# Seasonal plot using ggplot2 (forecast package)
autoplot(AirPassengers) +
ggtitle("Time Series Plot of Air Passengers") +
xlab("Year") + ylab("Passengers")
# Seasonal pattern visualization
ggseasonplot(AirPassengers, year.labels = TRUE, year.labels.left = TRUE) +
ggtitle("Seasonal Plot: Air Passengers")
# Subseries plot
ggsubseriesplot(AirPassengers) +
ggtitle("Subseries Plot: Air Passengers")
# Optional: Forecasting
fit <- auto.arima(AirPassengers)
forecast_values <- forecast(fit, h = 12)
autoplot(forecast_values)
OUTPUT:
1. Display of Dataset (First Few Records):
> head(AirPassengers)
[1] 112 118 132 129 121 135
2. Summary of the Dataset:
> summary(AirPassengers)
Min. 1st Qu. Median Mean 3rd Qu. Max.
104.0 180.0 265.5 280.3 360.5 622.0
RESULT:
Time series analysis was successfully performed on the AirPassengers dataset.
The analysis revealed a strong upward trend and clear seasonal patterns in the
data.
Various visualization techniques such as decomposition, seasonal, and
forecasting plots effectively demonstrated these time-based behaviors.
EX.NO:6
Perform Data Analysis and Representation on
DATE: a Map Using Various Map Datasets with
Mouse Rollover Effect and User Interaction
AIM:
To analyze spatial data and represent it visually on an interactive map using
R.
The experiment demonstrates mouse rollover effects, user interactions (zoom,
pan, popups), and data visualization using different map datasets.
ALGORITHM:
Start the R environment (RStudio or R GUI).
Install and load the required libraries:
leaflet – for interactive map visualization.
dplyr – for data manipulation.
sf – for handling spatial data.
maps or rnaturalearth – for world or country-level geographic data.
Load a map dataset:
Use built-in world/country shapefiles or import custom data using
st_read().
Example: Use rnaturalearth to get a world map or maps for U.S.
states.
Perform data analysis:
Analyze attributes such as population, density, or GDP associated with
geographic regions.
Use dplyr to summarize or group the data as needed.
Merge the geographic data with analytical results to create a data frame
containing spatial and statistical data.
Create an interactive map using leaflet:
Initialize the map with leaflet() and addTiles().
Add polygons or markers representing each region.
Use color palettes (colorNumeric or colorBin) to visualize data
intensity.
Add interactivity:
Use addPopups() or addLabelOnlyMarkers() for tooltips and
rollovers.
Add legends, layers, and zoom controls for better user experience.
Display the map:
Render the interactive map in the RStudio Viewer or web brows
PROGRAM (in R):
# Load necessary libraries
library(leaflet)
library(dplyr)
library(rnaturalearth)
library(rnaturalearthdata)
library(sf)
# Load world map data
world <- ne_countries(scale = "medium", returnclass = "sf")
# Create a sample dataset (Population data)
data <- world %>%
mutate(pop_density = pop_est / area_km2)
# Define color palette based on population density
pal <- colorNumeric(palette = "YlOrRd", domain = data$pop_density)
# Create interactive map
leaflet(data = data) %>%
addTiles() %>%
addPolygons(
fillColor = ~pal(pop_density),
weight = 1,
opacity = 1,
color = "white",
dashArray = "3",
fillOpacity = 0.7,
highlight = highlightOptions(
weight = 3,
color = "#666",
dashArray = "",
fillOpacity = 0.7,
bringToFront = TRUE
),
label = ~paste0(name, ": ", round(pop_density, 2), " people/km²"),
labelOptions = labelOptions(
style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "13px",
direction = "auto"
) %>%
addLegend(
pal = pal,
values = ~pop_density,
opacity = 0.7,
title = "Population Density",
position = "bottomright"
)
OUTPUT:
1. Map Visualization Output:
o A world map appears in the RStudio Viewer or default web browser.
o Each country is filled with a color gradient representing population
density (calculated as population per square kilometer).
2. : After executing the R program, the following outputs are observed
o Countries with higher population density appear in dark red, while
those with lower density appear in light yellow.
3. Mouse Rollover Effect:
o When the mouse pointer is moved over a country:
The country region highlights with a bold border.
A tooltip label appears showing:
Country Name: Population Density
Example:
India: 420.52 people/km²
China: 380.11 people/km²
Australia: 3.24 people/km²
User Interaction Features:
The user can:
o Zoom in and out of the map using the zoom control buttons.
o Pan or drag the map to view different regions.
o Hover over countries to view their details dynamically.
Legend Output:
A color legend is displayed at the bottom right corner of the map.
It indicates how color intensity corresponds to population density:
Yellow → Low Density
Orange → Moderate Density
Red → High Density
Sample Console Output (if any):
> library(leaflet)
> library(rnaturalearth)
> library(dplyr)
> leaflet(data = data)
Rendering map... done!
RESULT:
An interactive world map was successfully generated using the Leaflet library in
R.
The map displayed population density for each country with color-coded
visualization, along with mouse rollover effects, zooming, and user interaction
controls, enabling effective and interactive spatial data analysis.
EX.NO:7
Perform Exploratory Data Analysis (EDA) on
DATE: Wine Quality Dataset
AIM:
To perform Exploratory Data Analysis (EDA) on the Wine Quality dataset to
understand its structure, identify patterns, detect missing values, and analyze
relationships between various chemical properties and wine quality using R.
ALGORITHM:
Start RStudio or R environment.
Load required libraries:
ggplot2 for visualization.
dplyr for data manipulation.
corrplot for correlation analysis.
readr for reading CSV data.
Import the dataset:
Load the Wine Quality Dataset (e.g., winequality-red.csv or
winequality-white.csv) using read.csv() or
readr::read_csv().
Inspect the dataset:
Use functions such as head(), str(), summary(), and dim() to
understand data structure and summary statistics.
Check for missing values:
Use sum(is.na(data)) to find missing or null values.
Perform univariate analysis:
Plot histograms, boxplots, or density plots for numerical variables to study
their distributions.
Perform bivariate analysis:
Use scatter plots and correlation matrices to analyze relationships between
predictors and wine quality.
Calculate correlation matrix:
Compute correlations using cor() and visualize using corrplot().
Perform feature analysis:
Identify features most strongly correlated with quality.
Draw inferences and observations based on the visualizations and statistical
results.
End.
PROGRAM (in R):
(“ Density by Quality") +
theme_minimal()# Load required libraries
library(ggplot2)
library(dplyr)
library(corrplot)
library(readr)
# Import the dataset
wine <- read_csv("winequality-red.csv")
# View the structure and summary of the dataset
str(wine)
summary(wine)
dim(wine)
# Check for missing values
sum(is.na(wine))
# Univariate Analysis
ggplot(wine, aes(x = quality)) +
geom_bar(fill = "skyblue") +
ggtitle("Distribution of Wine Quality") +
xlab("Quality") + ylab("Count")
# Boxplot for alcohol vs quality
ggplot(wine, aes(x = as.factor(quality), y = alcohol, fill = as.factor(quality))) +
geom_boxplot() +
ggtitle("Alcohol Content vs Wine Quality") +
xlab("Wine Quality") + ylab("Alcohol (%)")
# Correlation matrix
corr_matrix <- cor(wine %>% select(-quality))
corrplot(corr_matrix, method = "color", type = "upper", tl.col = "black", tl.srt =
45)
# Scatter plot: alcohol vs density
ggplot(wine, aes(x = alcohol, y = density, color = as.factor(quality))) +
geom_point(alpha = 0.6) +
ggtitle("Alcohol”)
OUTPUT:
1. Dataset Summary:
> dim(wine)
[1] 1599 12
> head(wine)
fixed.acidity volatile.acidity citric.acid residual.sugar chlorides ...
7.4 0.70 0.00 1.9 0.076
...
Missing Values Check:
> sum(is.na(wine))
[1] 0
RESULT:
Exploratory Data Analysis (EDA) was successfully performed on the Wine
Quality Dataset using R.
The analysis revealed that alcohol, sulphates, and citric acid are positively
correlated with wine quality, while volatile acidity negatively
affects quality.
The dataset is clean, with no missing values, and visualizations effectively
represent important data patterns.