AD3301-Data Exploration and Visualization
List of Experiments
S.No Date Experiment Name Page Marks
No
1. Installation of the data Analysis and
Visualization tool Power BI
2. Implementation of exploratory data
analysis (EDA) on email data set.
3. Implementation of Numpy arrays,
Pandas data frames, Basic plots using
Matplotlib
4. Implementation of various variable and
row filters in R and various plot
features in R
5. Implementation of Time Series
Analysis and various visualization
techniques.
6. Implementation of Data Analysis and
representation on a Map using various
Map data sets
7. Implementation of cartographic
visualization for multiple datasets
8. Implementation of EDA on Wine
Quality Data Set.
9. Case study on a data set and apply the
various EDA and visualization
techniques and present an analysis
report.
Total
Staff Incharge
1
Ex.No: Installation of the data Analysis and Visualization tool
01 Power BI
AIM
ALGORITHM
PROGRAM
1. Download and Install Power BI Desktop
1. Download Power BI Desktop:
o Go to the Power BI Desktop download page.
o Click on “Download Free” to start the download.
2. Run the Installer:
2
o Locate the downloaded installer file (usually in your
Downloads folder).
o Double-click the file to start the installation process.
o Follow the on-screen instructions to complete the installation.
You may need to accept the license agreement and choose an
installation location.
3
3. Launch Power BI Desktop:
o After installation, open Power BI Desktop from the Start menu
or desktop shortcut.
2. Initial Configuration
1. Configure Initial Settings:
o When Power BI Desktop opens for the first time, you may be
prompted to sign in. You can sign in with a Microsoft account,
but this is optional for local use.
o Choose any initial settings based on your preferences or leave
them as defaults.
4
3. Import Data
1. Load Data:
o Click on the “Home” tab in the ribbon.
o Click “Get Data” to open the data source options.
o Choose the type of data source (e.g., Excel, CSV, SQL Server)
and follow the prompts to connect to your data source.
2. Transform and Clean Data:
o After importing the data, you may need to perform data
cleaning or transformation.
o Use the “Transform Data” button to open Power Query Editor,
where you can clean and transform the data (e.g., remove
duplicates, filter rows).
5
4. Create Visualizations
1. Add Visualizations:
o Once your data is loaded and cleaned, you can start creating
visualizations.
o In the “Report” view, drag and drop fields from your data onto
the report canvas.
o Choose from various visualization types (e.g., bar charts, line
charts, pie charts) from the “Visualizations” pane.
6
2. Customize Visualizations:
o Click on a visualization to configure its properties (e.g., axis
titles, colors).
o Use the “Format” pane to adjust visual settings.
7
3. Arrange Visualizations:
o Arrange and resize visualizations on the canvas to create a
meaningful report layout.
5. Publish and Share Reports
1. Publish to Power BI Service (Optional):
o Click the “Publish” button on the “Home” tab.
o Sign in with your Microsoft account if required.
o Choose a workspace in Power BI Service where you want to
publish the report.
8
2. Share Reports:
o After publishing, you can share the report with others by
providing them access through the Power BI Service.
o Use sharing options available in Power BI Service to control
who can view or interact with the report.
RESULT
Ex.No: Implementation of exploratory data analysis (EDA) on
02 email data set
AIM
ALGORITHM
9
PROGRAM
import pandas as pd
# Load the email dataset
df = pd.read_csv('emails.csv')
# Display the first few rows of the dataset
print(df.head())
# Display basic information about the dataset
print(df.info())
# Check for missing values
print(df.isnull().sum())
# Convert the 'Date' column to datetime format
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
# Fill missing values in the 'Sender' and 'Subject' columns if any
df['Sender'].fillna('Unknown', inplace=True)
df['Subject'].fillna('No Subject', inplace=True)
# Check for any remaining missing values
print(df.isnull().sum())
import matplotlib.pyplot as plt
# Plot the number of emails over time
plt.figure(figsize=(10, 6))
df['Date'].groupby(df['Date'].dt.to_period('M')).count().plot(kind='bar')
plt.title('Number of Emails Over Time')
plt.xlabel('Date')
plt.ylabel('Number of Emails')
plt.show()
# Top 10 senders
top_senders = df['Sender'].value_counts().head(10)
10
# Plotting the top senders
plt.figure(figsize=(10, 6))
top_senders.plot(kind='bar', color='skyblue')
plt.title('Top 10 Senders')
plt.xlabel('Sender')
plt.ylabel('Number of Emails')
plt.show()
from wordcloud import WordCloud
# Generate a word cloud from the subject lines
wordcloud = WordCloud(width=800, height=400,
background_color='white').generate(' '.join(df['Subject']))
# Display the word cloud
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Email Subjects')
plt.show()
# Extract the hour from the 'Date' column
df['Hour'] = df['Date'].dt.hour
# Plot the number of emails by hour
plt.figure(figsize=(10, 6))
df['Hour'].value_counts().sort_index().plot(kind='bar', color='orange')
plt.title('Emails Received by Hour of the Day')
plt.xlabel('Hour')
plt.ylabel('Number of Emails')
plt.show()
# Calculate the length of each email body
df['Body_Length'] = df['Body'].apply(lambda x: len(str(x).split()))
# Plot the distribution of email lengths
plt.figure(figsize=(10, 6))
df['Body_Length'].plot(kind='hist', bins=30, color='green')
plt.title('Distribution of Email Lengths')
plt.xlabel('Number of Words')
plt.ylabel('Number of Emails')
plt.show()
11
OUTPUT
12
13
RESULT
Ex.No: Implementation of Numpy arrays, Pandas data frames ,
03 Basic plots using Matplotlib
AIM
ALGORITHM
PROGRAM
1. NumPy Arrays
pip install numpy
import numpy as np
# 1D Array
arr_1d = np.array([1, 2, 3, 4, 5])
print("1D Array:", arr_1d)
# 2D Array
arr_2d = np.array([[1, 2, 3], [4, 5, 6]])
print("2D Array:\n", arr_2d)
# Array with specific shape and values
zeros = np.zeros((2, 3))
ones = np.ones((3, 2))
identity = np.eye(3)
print("Zeros Array:\n", zeros)
print("Ones Array:\n", ones)
14
print("Identity Matrix:\n", identity)
# Element-wise operations
arr_add = arr_1d + 10
arr_mult = arr_2d * 2
print("Array after addition:\n", arr_add)
print("Array after multiplication:\n", arr_mult)
# Matrix operations
matrix_mult = np.dot(arr_2d, np.array([[1, 0], [0, 1], [1, 0]]))
print("Matrix Multiplication:\n", matrix_mult)
2. Pandas DataFrames
pip install pandas
import pandas as pd
# Creating DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print("DataFrame:\n", df)
# Creating DataFrame from a CSV file
# df = pd.read_csv('file.csv') # Uncomment and use if you have a CSV
file
# Basic statistics
print("Statistics:\n", df.describe(include='all'))
# Accessing specific columns
print("Names:\n", df['Name'])
# Filtering rows
filtered_df = df[df['Age'] > 28]
print("Filtered DataFrame:\n", filtered_df)
# Adding a new column
df['Country'] = 'USA'
print("DataFrame with New Column:\n", df)
15
3. Basic Plots Using Matplotlib
pip install matplotlib
import matplotlib.pyplot as plt
# Line Plot
x = np.linspace(0, 10, 100)
y = np.sin(x)
plt.figure(figsize=(10, 6))
plt.plot(x, y, label='Sine Wave', color='blue', linestyle='-')
plt.title('Line Plot')
plt.xlabel('X axis')
plt.ylabel('Y axis')
plt.legend()
plt.grid(True)
plt.show()
# Scatter Plot
x = np.random.rand(50)
y = np.random.rand(50)
plt.figure(figsize=(10, 6))
plt.scatter(x, y, color='red', alpha=0.5)
plt.title('Scatter Plot')
plt.xlabel('X axis')
plt.ylabel('Y axis')
plt.grid(True)
plt.show()
# Histogram
data = np.random.randn(1000)
plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, edgecolor='black')
plt.title('Histogram')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
# Bar Plot
categories = ['A', 'B', 'C']
16
values = [10, 20, 15]
plt.figure(figsize=(10, 6))
plt.bar(categories, values, color='green')
plt.title('Bar Plot')
plt.xlabel('Category')
plt.ylabel('Value')
plt.grid(True)
plt.show()
OUTPUT
17
18
RESULT
Ex.No: Implementation of various variable and row filters in R
04 and various plot features in R
AIM
ALGORITHM
19
PROGRAM
Step 1: Data Loading and Initial Inspection
# Load a sample dataset (e.g., the 'mtcars' dataset)
data(mtcars)
# View the first few rows of the dataset
head(mtcars)
# Get a summary of the dataset
summary(mtcars)
Step 2: Variable Filtering
# Select specific columns (e.g., 'mpg', 'hp', 'wt')
selected_vars <- mtcars[, c('mpg', 'hp', 'wt')]
# View the filtered dataset
head(selected_vars)
Step 3: Row Filtering
# Filter rows where mpg (miles per gallon) is greater than 20
filtered_rows <- mtcars[mtcars$mpg > 20, ]
# Filter rows where 'cyl' (cylinders) equals 4
filtered_rows_cyl <- mtcars[mtcars$cyl == 4, ]
# View the filtered dataset
head(filtered_rows)
head(filtered_rows_cyl)
Step 4: Data Cleaning
# Remove rows with missing values (if any)
cleaned_data <- na.omit(mtcars)
# Rename columns for clarity
colnames(cleaned_data) <- c('Miles_Per_Gallon', 'Cylinders',
'Displacement',
'Horsepower', 'Rear_Axle_Ratio', 'Weight',
'Quarter_Mile_Time', 'Transmission', 'Gears',
'Carburetors')
# View the cleaned dataset
head(cleaned_data)
Step 5: Data Visualization
# Load necessary libraries
library(ggplot2)
20
# Scatter plot of Horsepower vs Miles Per Gallon
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point(color = 'blue') +
labs(title = "Horsepower vs Miles Per Gallon", x = "Horsepower", y =
"Miles Per Gallon")
# Histogram of Miles Per Gallon
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 2, fill = 'green', color = 'black') +
labs(title = "Distribution of Miles Per Gallon", x = "Miles Per Gallon", y =
"Frequency")
# Boxplot of Miles Per Gallon by Cylinder
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
geom_boxplot(fill = 'orange', color = 'black') +
labs(title = "Miles Per Gallon by Cylinder", x = "Number of Cylinders", y
= "Miles Per Gallon")
OUTPUT
21
22
23
RESULT
Ex.No: Implementation of Time Series Analysis and various
05 visualization techniques
AIM
ALGORITHM
PROGRAM
pip install pandas numpy matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Create a time series dataset
date_rng = pd.date_range(start='2023-01-01', end='2023-12-31',
freq='D')
24
data = pd.DataFrame(date_rng, columns=['date'])
data['value'] = np.sin(np.linspace(0, 10, len(date_rng))) +
np.random.normal(0, 0.5, len(date_rng))
# Set the date column as the index
data.set_index('date', inplace=True)
print(data.head())
plt.figure(figsize=(12, 6))
plt.plot(data.index, data['value'], label='Value')
plt.title('Time Series Plot')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.grid(True)
plt.show()
# Compute rolling statistics
data['rolling_mean'] = data['value'].rolling(window=30).mean()
data['rolling_std'] = data['value'].rolling(window=30).std()
plt.figure(figsize=(12, 6))
plt.plot(data.index, data['value'], label='Value', alpha=0.5)
plt.plot(data.index, data['rolling_mean'], label='Rolling Mean', color='red')
plt.plot(data.index, data['rolling_std'], label='Rolling Std Dev',
color='orange')
plt.title('Rolling Statistics')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.grid(True)
plt.show()
from pandas.plotting import autocorrelation_plot
plt.figure(figsize=(12, 6))
autocorrelation_plot(data['value'])
plt.title('Autocorrelation Plot')
plt.grid(True)
plt.show()
# Extract month from the index
25
data['month'] = data.index.month
# Plot by month
plt.figure(figsize=(12, 6))
for month in range(1, 13):
monthly_data = data[data['month'] == month]
plt.plot(monthly_data.index, monthly_data['value'], label=f'Month
{month}')
plt.title('Seasonal Plot by Month')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.grid(True)
plt.show()
OUTPUT
26
RESULT
27
Ex.No: Implementation of Data Analysis and representation on
06 a Map using various Map data sets
AIM
ALGORITHM
PROGRAM
pip install folium geopandas plotly pandas
import folium
import geopandas as gpd
import pandas as pd
# Step 2: Load Geographic Data
# Example dataset: Load a sample GeoDataFrame
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
# Load some data for analysis (e.g., population or any other metric)
# For this example, we'll use a simple dataset:
data = pd.DataFrame({
'iso_a3': ['USA', 'CAN', 'MEX'],
'population': [331002651, 37742154, 126190788]
})
# Merge the data with the GeoDataFrame
world = world.merge(data, how='left', on='iso_a3')
# Step 3: Create a map with Folium
m = folium.Map(location=[20, 0], zoom_start=2)
# Step 4: Add Mouse Rollover and User Interaction
28
# Add a Choropleth map
folium.Choropleth(
geo_data=world,
name="choropleth",
data=world,
columns=["iso_a3", "population"],
key_on="feature.properties.iso_a3",
fill_color="YlGn",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="Population by Country",
).add_to(m)
# Add GeoJson for mouse rollover effect
folium.GeoJson(
world,
name="Population Info",
tooltip=folium.GeoJsonTooltip(
fields=["name", "population"],
aliases=["Country", "Population"],
localize=True,
),
).add_to(m)
# Step 5: Save and Display the Map
m.save("interactive_map.html")
# Display map in Jupyter Notebook (if using Jupyter)
m
OUTPUT
29
RESULT
Ex.No: Implementation of cartographic visualization for
07 multiple datasets
AIM
ALGORITHM
30
PROGRAM
pip install geopandas folium matplotlib
import geopandas as gpd
import folium
import matplotlib.pyplot as plt
# Load the world map dataset
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
# Display the first few rows
print(world.head())
# Plot the world map
world.plot(figsize=(15, 10), edgecolor='black')
plt.title('World Map')
plt.show()
# Create a base map centered around the world
m = folium.Map(location=[20, 0], zoom_start=2)
# Add countries as GeoJson to the map
folium.GeoJson(world).add_to(m)
# Save the map as an HTML file
m.save("world_map.html")
# Example: Adding population data to the world map
# Add a column to the world dataset for population (you may need to add
real data here)
world['pop_est'] = world['pop_est'].fillna(0) # Fill NA values with 0
# Plot using matplotlib with color based on population
world.plot(column='pop_est', figsize=(15, 10), legend=True,
cmap='OrRd', edgecolor='black')
31
plt.title('World Population Map')
plt.show()
# Display in Jupyter Notebook (if using Jupyter)
m # m is the folium map object
OUTPUT
32
RESULT
Ex.No: Implementation of EDA on Wine Quality Data Set
08
AIM
ALGORITHM
33
PROGRAM
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load the wine quality dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-
quality/winequality-red.csv'
wine_data = pd.read_csv(url, sep=';')
# Display the first few rows of the dataset
print(wine_data.head())
# Basic information about the dataset
print(wine_data.info())
# Summary statistics
print(wine_data.describe())
# Check for missing values
print(wine_data.isnull().sum())
# Distribution of wine quality
plt.figure(figsize=(8, 6))
sns.countplot(x='quality', data=wine_data, palette='viridis')
plt.title('Distribution of Wine Quality Ratings')
plt.show()
# Correlation heatmap
plt.figure(figsize=(10, 8))
corr = wine_data.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap of Wine Quality Data')
plt.show()
# Pairplot of selected features
sns.pairplot(wine_data, vars=['fixed acidity', 'volatile acidity', 'citric acid',
'quality'], hue='quality', palette='viridis')
plt.show()
# Alcohol content vs quality
plt.figure(figsize=(10, 6))
sns.boxplot(x='quality', y='alcohol', data=wine_data, palette='viridis')
34
plt.title('Alcohol Content vs Wine Quality')
plt.show()
# Fixed acidity vs quality
plt.figure(figsize=(10, 6))
sns.boxplot(x='quality', y='fixed acidity', data=wine_data,
palette='viridis')
plt.title('Fixed Acidity vs Wine Quality')
plt.show()
# Example of handling imbalance (if necessary)
from sklearn.utils import resample
# Separate the minority and majority classes
df_majority = wine_data[wine_data.quality >= 6]
df_minority = wine_data[wine_data.quality < 6]
# Upsample minority class
df_minority_upsampled = resample(df_minority,
replace=True, # sample with replacement
n_samples=len(df_majority), # to match majority
class
random_state=42) # reproducible results
# Combine majority class with upsampled minority class
wine_data_balanced = pd.concat([df_majority, df_minority_upsampled])
# Display new class counts
print(wine_data_balanced['quality'].value_counts())
# 3D Scatter plot
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(wine_data['sulphates'], wine_data['alcohol'],
wine_data['quality'], c=wine_data['quality'], cmap='viridis')
ax.set_xlabel('Sulphates')
ax.set_ylabel('Alcohol')
ax.set_zlabel('Quality')
plt.show()
OUTPUT
35
36
37
38
RESULT
Ex.No: Case study on a data set and apply the various EDA and
09 visualization techniques and present an analysis report.
39
AIM
ALGORITHM
PROGRAM
The Titanic dataset is a classic dataset often used to demonstrate various
data analysis techniques. This dataset provides information on the
passengers aboard the Titanic, including whether they survived the
disaster, their age, gender, class, fare, and other details. This case study
will apply various EDA and visualization techniques to uncover insights
from this data and present an analysis report.
Dataset Overview
Dataset: Titanic passenger data.
Features:
o Survived: Survival (0 = No, 1 = Yes)
o Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
o Sex: Sex
o Age: Age in years
o SibSp: Number of siblings/spouses aboard the Titanic
o Parch: Number of parents/children aboard the Titanic
o Fare: Passenger fare
o Embarked: Port of Embarkation (C = Cherbourg; Q =
Queenstown; S = Southampton)
Step 1: Data Loading and Inspection
import pandas as pd
# Load the Titanic dataset
titanic_data =
pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/
master/titanic.csv')
40
# Display the first few rows of the dataset
print(titanic_data.head())
# Basic information about the dataset
print(titanic_data.info())
# Summary statistics
print(titanic_data.describe())
Observation:
The dataset has 891 entries and several features, some of which
have missing values (e.g., Age, Cabin, and Embarked).
Step 2: Data Cleaning
# Handle missing values by filling or dropping
titanic_data['Age'].fillna(titanic_data['Age'].median(), inplace=True)
titanic_data['Embarked'].fillna(titanic_data['Embarked'].mode()[0],
inplace=True)
titanic_data.drop(columns=['Cabin'], inplace=True) # Drop Cabin due to
too many missing values
# Verify missing values
print(titanic_data.isnull().sum())
Observation:
Missing values have been handled: Age filled with median,
Embarked filled with mode, and Cabin dropped.
Step 3: Univariate Analysis
import seaborn as sns
import matplotlib.pyplot as plt
# Plotting the distribution of survival
sns.countplot(x='Survived', data=titanic_data, palette='pastel')
plt.title('Survival Distribution')
plt.show()
# Distribution of Age
sns.histplot(titanic_data['Age'], bins=30, kde=True, color='blue')
plt.title('Age Distribution of Passengers')
plt.show()
Observation:
The majority of passengers did not survive.
Most passengers are between 20 and 40 years old.
Step 4: Bivariate Analysis
# Survival rate by class
sns.barplot(x='Pclass', y='Survived', data=titanic_data, palette='viridis')
41
plt.title('Survival Rate by Passenger Class')
plt.show()
# Survival rate by gender
sns.barplot(x='Sex', y='Survived', data=titanic_data, palette='viridis')
plt.title('Survival Rate by Gender')
plt.show()
# Age distribution by survival
sns.boxplot(x='Survived', y='Age', data=titanic_data, palette='pastel')
plt.title('Age Distribution by Survival')
plt.show()
Observation:
Passengers in 1st class had a higher survival rate.
Females had a significantly higher survival rate than males.
Younger passengers were more likely to survive.
Step 5: Multivariate Analysis
# Survival rate by class and gender
sns.catplot(x='Pclass', hue='Sex', col='Survived', kind='count',
data=titanic_data, palette='pastel')
plt.show()
# Correlation heatmap
plt.figure(figsize=(10, 8))
corr = titanic_data.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()
Observation:
The survival rate is highest for females in 1st class.
There is a strong negative correlation between Pclass and Survived,
indicating that higher-class passengers were more likely to survive.
Insights
1. Class and Survival: Passengers in higher classes had better
survival chances, with 1st class being the safest.
2. Gender and Survival: Females had a much higher survival rate,
especially in the 1st and 2nd classes.
3. Age Factor: Younger passengers had a better survival rate, with
children particularly having a higher chance of survival.
4. Embarkation Point: Passengers who embarked at Cherbourg (C)
had a higher survival rate compared to other embarkation points.
5. Multivariate Interaction: The combination of gender, class, and
age played a crucial role in determining the survival of a passenger.
42
OUTPUT
43
44
45
RESULT
46