Ex.
no: 01 Install the data Analysis and Visualization tool: R
Date:
Aim:
Install R, a data analysis and visualization tool, to leverage its capabilities for
statistical analysis and graphical representation of data.
Procedure:
1. Download and install the latest version of R from the official R website.
2. Optionally, install an integrated development environment (IDE) such as
RStudio for a user-friendly interface.
3. Use the R console or an IDE to execute R scripts and commands for data
analysis and visualization.
Installation/Output:
Result:
Access to a powerful data analysis and visualization tool, enabling the
exploration and representation of data through statistical methods and graphs
in the R environment.
Ex.no: 02 Perform exploratory data analysis (EDA) with
email data set.
Date:
Aim:
To Perform exploratory data analysis (EDA) with datasets like email data
set. Export all your emails as a dataset, import them inside a pandas data frame,
visualize them and get different insights from the data.
Procedure:
Exporting Email Data:
Export your email data in a suitable format (e.g., CSV, JSON, or plain text).
Importing Data into Pandas:
Use Pandas to import your email data into a DataFrame. You can use
pd.read_csv(), pd.read_json(), or other relevant functions depending on the data
format.
Exploratory Data Analysis:
Start by examining the basic properties of the dataset. Use functions like info(),
head(), and describe() to get an overview.
Data Cleaning:
Clean the data by handling missing values, removing duplicates, and addressing
any other data quality issues.
Visualization:
Utilize libraries like Matplotlib or Seaborn to create visualizations. For email
data, you might want to create:Histograms of email frequencies over time.Bar
charts showing the most frequent senders or recipients.Word clouds to identify
common words or phrases in emails.Time series plots for email activity.
Program:
import pandas as pd
import matplotlib.pyplot as plt
# Sample data
data = {
"Sender": ["[email protected]", "[email protected]", "[email protected]",
"[email protected]", "[email protected]"],
"Receiver": ["[email protected]", "[email protected]",
"[email protected]", "[email protected]", "[email protected]"],
"Subject": ["Meeting", "Meeting", "Meeting", "Meeting", "Meeting"],
"Body": ["3pm Yes, Let's meet at 4", "OK, 3 pm works for me", "OK, let's
meet at 5", "Let's meet?", "Sure, I'll be there"],
}
# Create a DataFrame
df = pd.DataFrame(data)
# Count the number of emails sent by each sender
sender_counts = df["Sender"].value_counts()
# Plot the counts
sender_counts.plot(kind="bar")
plt.xlabel("Sender")
plt.ylabel("Number of emails")
plt.title("Number of emails sent by each sender")
plt.show()
Output:
Result:
Thus,the given program perform exploratory data analysis (EDA) with email
data set have been executed successfully.
Ex.no:03 Working with Numpy arrays, Pandas data frames ,
Date: Basic plots using Matplotlib
Aim:
To write a python with Numpy arrays, Pandas data frames , Basic plots
using Matplotlib.
Procedure:
1. Start program.
2. Import library like numpy ,pandas , matplotlib .
3. Create a Array and Dataframe.
4. Show the plot with given array.
5. Stop.
Program:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Creating a NumPy array
data = np.array([1, 2, 3, 4, 5])
# Creating a Pandas DataFrame
df = pd.DataFrame({'Values': data})
# Plotting the data
plt.plot(df['Values'])
plt.xlabel('Index')
plt.ylabel('Values')
plt.title('Simple Line Plot')
plt.show()
Output:
Result:
Thus,the given program to perform Numpy arrays, Pandas data frames ,
Basic plots using Matplotlib have been executed successfully.
Ex.no:04 Explore various variable and row filters in R for
cleaning data
Date:
Aim:
To explore various variable and row filters in R for cleaning data. Apply
various plot features in R on sample data sets and visualize.
Algorithm:
1. Start the program.
2. Declare a Dataset of various variable.
3. Cleaning the given dataset .
4. Visualization the given variable based on the libraries ggplot2,base R
graphics.
5. Stop.
Program:
# Sample data frame
data <- data.frame(
Name = c("Alice", "Bob", "Charlie", "David", "Eve"),
Age = c(25, 32, 29, NA, 27),
Score = c(85, 92, 78, 64, 89)
)
data_cleaned<- na.omit(data)
data_cleaned<- unique(data)
data_filtered<- data[data$Age> 30, ]
library(ggplot2)
# Scatter plot
ggplot(data, aes(x = Age, y = Score)) +
geom_point() +
labs(x = "Age", y = "Score") +
ggtitle("Scatter Plot of Age vs. Score")
# Mean Age
mean_age<- mean(data$Age, na.rm = TRUE)
cat("Mean Age:", mean_age, "\n")
# Median Score
median_score<- median(data$Score)
cat("Median Score:", median_score, "\n")
# Histogram of Age
ggplot(data, aes(x = Age)) +
geom_histogram() +
labs(x = "Age", y = "Frequency") +
ggtitle("Histogram of Age")
Output:
Warning message:
Removed 1 rows containing missing values (geom_point).
Mean Age: 28.25
Median Score: 85
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
Removed 1 rows containing non-finite values (stat_bin).
[Execution complete with exit code 0]
Result:
Thus, the R program to cleaning the data and visualization has been
executed successfully and output is verified.
Ex.no:05 Perform Time Series Analysis and apply the
various visualization techniques
Date:
Aim:
To perform Time Series Analysis and apply the various visualization
techniques.
Algorithm:
1. Start the program.
2. Load required libraries with prepared data.
3. Create a model with Time series.
4. Filter the model for further prediction.
5. Visualize the given model.
6. Stop.
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.seasonal import seasonal_decompose
# Load the Air Passengers dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-
passengers.csv"
df = pd.read_csv(url, parse_dates=['Month'], index_col='Month')
# Display the first few rows of the dataset
print(df.head())
# Visualize the time series
plt.figure(figsize=(12, 6))
plt.plot(df.index, df['Passengers'], label='Passenger Count')
plt.title('Airline Passengers Over Time')
plt.xlabel('Year')
plt.ylabel('Number of Passengers')
plt.legend()
plt.show()
# Time Series Decomposition
result = seasonal_decompose(df['Passengers'], model='multiplicative',
period=12)
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(12, 8), sharex=True)
ax1.plot(result.trend, label='Trend')
ax1.set_title('Trend Component')
ax2.plot(result.seasonal, label='Seasonal')
ax2.set_title('Seasonal Component')
ax3.plot(result.resid, label='Residual')
ax3.set_title('Residual Component')
ax4.plot(result.observed, label='Observed')
ax4.set_title('Observed')
plt.tight_layout()
plt.show()
# Rolling Statistics and ADF Test
rolling_mean = df['Passengers'].rolling(window=12).mean()
rolling_std = df['Passengers'].rolling(window=12).std()
plt.figure(figsize=(12, 6))
plt.plot(df['Passengers'], label='Original')
plt.plot(rolling_mean, label='Rolling Mean (12 months)')
plt.plot(rolling_std, label='Rolling Std (12 months)')
plt.title('Rolling Mean & Standard Deviation')
plt.legend()
plt.show()
# Seasonal Decomposition of Residuals (STL)
from statsmodels.tsa.seasonal import STL
stl = STL(df['Passengers'], seasonal=13)
result_stl = stl.fit()
plt.figure(figsize=(12, 6))
plt.plot(result_stl.trend, label='Trend')
plt.plot(result_stl.seasonal, label='Seasonal')
plt.plot(result_stl.resid, label='Residual')
plt.title('Seasonal-Trend decomposition using LOESS (STL)')
plt.legend()
plt.show()
# Autocorrelation and Partial Autocorrelation
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
plt.figure(figsize=(12, 6))
plot_acf(df['Passengers'], lags=40, alpha=0.05)
plt.title('Autocorrelation Function (ACF)')
plt.show()
plt.figure(figsize=(12, 6))
plot_pacf(df['Passengers'], lags=40, alpha=0.05)
plt.title('Partial Autocorrelation Function (PACF)')
plt.show()
Output:
Passengers
Month
1949-01-01 112
1949-02-01 118
1949-03-01 132
1949-04-01 129
1949-05-01 121
Result:
Thus ,the given R program to perform Time Series Analysis and apply
the various visualization techniques has been executed successfully.
Ex.no:06 Perform Data Analysis and representation on a
Map using various Map data sets with Mouse
Date:
Rollover effect, user interaction, etc
AIM:
Create an interactive map using Folium to display markers for specified
cities, showcasing their populations with tooltips and popups.
Algorithm:
1. Start the program.
2. Load required libraries with prepared data.
3. Create a model with Time series.
4. Filter the model for further prediction.
5. Visualize the given model.
6. Stop.
Program:
import folium
import pandas as pd
# Sample data (replace with your own dataset)
data = {
'City': ['New York', 'San Francisco', 'Los Angeles'],
'Population': [8175133, 884363, 3906772],
'Latitude': [40.7128, 37.7749, 34.0522],
'Longitude': [-74.0060, -122.4194, -118.2437]
}
df = pd.DataFrame(data)
# Create a folium map centered at a specific location
map_center = [37.7749, -122.4194]
map_obj = folium.Map(location=map_center, zoom_start=5)
# Add markers to the map with mouseover text
for index, row in df.iterrows():
folium.Marker(
location=[row['Latitude'], row['Longitude']],
popup=f"City: {row['City']} \nPopulation: {row['Population']}",
tooltip=row['City']
).add_to(map_obj)
# Save the map to an HTML file
map_obj.save('interactive_map.html')
Output:
Result:
An interactive map, saved as 'interactive_map.html', showcasing city
markers with tooltips and popups
Ex.no:07 Build cartographic visualization for multiple
datasets involving various countries of the
Date:
world; states and districts in India etc
AIM:
Develop cartographic visualizations for multiple datasets,
encompassing countries worldwide, and regions like states and districts
in India.
Algorithm:
1. Start the program.
2. Load required libraries with prepared data.
3. Create a model with Time series.
4. Filter the model for further prediction.
5. Visualize the given model.
6. Stop.
Program:
import geopandas as gpd
from shapely.geometry import Polygon
import matplotlib.pyplot as plt
import pandas as pd
# Create a GeoDataFrame for the world
world =
gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
# Create a GeoDataFrame for India
# Note: For simplicity, using a small polygon for India
india_geometry = gpd.GeoSeries([Polygon([(75, 20), (80, 20),
(80, 25), (75, 25)])], crs='EPSG:4326')
india = gpd.GeoDataFrame(geometry=india_geometry)
# Sample data for demonstration
world_data = {'Country': ['USA', 'China', 'India'],
'Population': [331002651, 1444216107, 1380004385]}
world_df = pd.DataFrame(world_data)
india_data = {'State': ['Maharashtra', 'Uttar Pradesh', 'Tamil
Nadu'],
'Population': [123144223, 223897418, 77841267]}
india_df = pd.DataFrame(india_data)
# Merge world and India data with the map data
world = world.merge(world_df, left_on='name',
right_on='Country', how='left')
india['Population'] = india_df['Population']
# Plot world map
fig, ax = plt.subplots(1, 2, figsize=(15, 7))
world.plot(column='Population', cmap='OrRd', ax=ax[0],
legend=True, legend_kwds={'label': "Population by Country"})
ax[0].set_title('World Population')
# Plot India map
india.plot(column='Population', cmap='OrRd', ax=ax[1],
legend=True, legend_kwds={'label': "Population by State"})
ax[1].set_title('India Population')
plt.show()
Output:
Result:
ThusProduce informative and visually engaging maps representing
diverse datasets for global countries and specific Indian regions.
Ex.no:08 Perform EDA on Wine Quality Data Set
Date:
Aim:
To perform EDA on Wine Quality Data Set.
Algorithm:
1. Start the program.
2. Import libraries like pandas,matplotlib,seaborn.
3. Initial Data Exploration and Summary Statistics.
4. Data Visualization the Dataset for further Analysis.
5. Stop.
Program:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv("winequalityN.csv")
print("First few rows of the dataset:")
print(data.head())
print("Summary statistics of the dataset:")
print(data.describe())
data.hist(bins=30, figsize=(12, 8))
plt.suptitle("Histograms of Features", y=1.02)
plt.show()
correlation_matrix = data.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm",
linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()
plt.figure(figsize=(12, 8))
sns.boxplot(data=data, width=0.5)
plt.xticks(rotation=45)
plt.title("Box Plots of Features")
plt.show()
plt.figure(figsize=(12, 6))
sns.histplot(data["alcohol"], kde=True)
plt.title("Alcohol Content Distribution")
plt.show()
Output:
First few rows of the dataset:
type fixed acidity volatile acidity ... sulphates alcohol quality
0 white 7.0 0.27 ... 0.45 8.8 6
1 white 6.3 0.30 ... 0.49 9.5 6
2 white 8.1 0.28 ... 0.44 10.1 6
3 white 7.2 0.23 ... 0.40 9.9 6
4 white 7.2 0.23 ... 0.40 9.9 6
[5 rows x 13 columns]
Summary statistics of the dataset:
fixed acidity volatile acidity ... alcohol quality
count 6487.000000 6489.000000 ...6497.000000 6497.000000
mean 7.216579 0.339691 ... 10.491801 5.818378
std 1.296750 0.164649 ... 1.192712 0.873255
min 3.800000 0.080000 ... 8.000000 3.000000
25% 6.400000 0.230000 ... 9.500000 5.000000
50% 7.000000 0.290000 ... 10.300000 6.000000
75% 7.700000 0.400000 ... 11.300000 6.000000
max 15.900000 1.580000 ... 14.900000 9.000000
[8 rows x 12 columns]
Result:
Thus, the python program to perform EDA on Wine Quality Data Set.
Ex.no:09 Use a case study on a data set and apply EDA ,
visualization techniques and present an
Date:
analysis report.
Aim:
Use a case study on a data set and apply the various EDA and visualization
techniques and present an analysis report.
Algorithm:
1. Start the program.
2. Import the necessary libraries :pandas,matplotlib,seaborn and
sklearn.dataset for data loading.
3. Loading the dataset and Exploring the Data.
4. Visualization the dataset and Analysing the report.
5. Stop.
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
iris = load_iris()
data = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
columns=iris['feature_names'] + ['target'])
print("First few rows of the dataset:")
print(data.head())
print("\nData Information:")
print(data.info())
print("\nSummary Statistics:")
print(data.describe())
sns.set(style="whitegrid")
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.histplot(data['sepal length (cm)'], kde=True)
plt.title("Distribution of Sepal Length")
plt.subplot(1, 2, 2)
sns.histplot(data['sepal width (cm)'], kde=True)
plt.title("Distribution of Sepal Width")
plt.show()
sns.set_style("whitegrid")
sns.pairplot(data, hue='target', markers=["o", "s", "D"])
plt.show()
plt.figure(figsize=(10, 6))
sns.boxplot(x='target', y='petal length (cm)', data=data)
plt.title("Petal Length Boxplot by Species")
plt.show()
correlation_matrix = data.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()
print("\nAnalysis Report:")
print("- The dataset contains three species of iris flowers: setosa, versicolor,
and virginica.")
print("- The features vary in their distributions, with sepal length and sepal
width showing different patterns.")
print("- The pairplot shows how the features are correlated and how they can
be used to distinguish between species.")
print("- The petal length is a strong predictor for species differentiation, with
setosa having the shortest petals and virginica the longest.")
print("- The correlation heatmap confirms that petal length is highly
correlated with the target variable, making it an important feature for
classification.")
Output:
First few rows of the dataset:
sepal length (cm) sepal width (cm) ... petal width (cm) target
0 5.1 3.5 ... 0.2 0.0
1 4.9 3.0 ... 0.2 0.0
2 4.7 3.2 ... 0.2 0.0
3 4.6 3.1 ... 0.2 0.0
4 5.0 3.6 ... 0.2 0.0
[5 rows x 5 columns]
Data Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal length (cm) 150 non-null float64
1 sepal width (cm) 150 non-null float64
2 petal length (cm) 150 non-null float64
3 petal width (cm) 150 non-null float64
4 target 150 non-null float64
dtypes: float64(5)
memory usage: 6.0 KB
None
Summary Statistics:
sepal length (cm) sepal width (cm) ... petal width (cm) target
count 150.000000 150.000000 ...150.000000 150.000000
mean 5.843333 3.057333 ... 1.199333 1.000000
std 0.828066 0.435866 ... 0.762238 0.819232
min 4.300000 2.000000 ... 0.100000 0.000000
25% 5.100000 2.800000 ... 0.300000 0.000000
50% 5.800000 3.000000 ... 1.300000 1.000000
75% 6.400000 3.300000 ... 1.800000 2.000000
max 7.900000 4.400000 ... 2.500000 2.000000
[8 rows x 5 columns]
Result:
Thus, the case study on a data set and apply the various EDA and
visualization techniques and present an analysis report.