Data Visualization Techniques Lab Manual
List of Experiments
1. Acquiring and plotting data.
2. Statistical Analysis – such as Multivariate Analysis, PCA, LDA, Correlation regression and analysis of
variance.
3. Financial analysis using Clustering, Histogram, and HeatMap.
4. Time-series analysis – Stock market.
5. Visualization of various massive datasets – Finance, Healthcare, Census, Geospatial.
6. Visualization on Streaming datasets (Stock market dataset, weather forecasting).
7. Market-Basket Data analysis and visualization.
8. Text visualization using web analytics.
EXPERIMENT NO 1: Acquiring and plotting data
AIM: To acquire data from various sources and plot it using Python/R.
Explanation:
Data Acquisition:
Data can be acquired from various sources like CSV files, Excel files, databases, or APIs.
For this example, we will create a small dataset manually using pandas.
Data Cleaning and Processing:
Ensure that the data is clean (no missing values, duplicates, etc.).
Convert data to a suitable format for plotting.
Plotting:
Use the matplotlib library to create a line plot.
Label the x-axis, y-axis, and provide a meaningful title.
Use markers to highlight the points on the graph.
PROGRAM:
import pandas as pd
import matplotlib.pyplot as plt
Step 1: Create sample data
data = pd.DataFrame({
'x': [1, 2, 3, 4, 5],
'y': [10, 20, 15, 25, 30]
})
Step 2: Plot the data
plt.plot(data['x'], data['y'], marker='o', linestyle='-', color='blue', label='Line Plot')
Step 3: Add labels and title
plt.title('Sample Data Plot')
plt.xlabel('X-Axis')
plt.ylabel('Y-Axis')
Step 4: Display the plot
plt.legend()
plt.show()
Output:
👉 Line Plot:
X-Axis: Values → 1, 2, 3, 4, 5
Y-Axis: Values → 10, 20, 15, 25, 30
Line Color: Blue
Markers: Circular points at each data point
Title: "Sample Data Plot"
Experiment 2: Statistical Analysis
Aim:
To perform Principal Component Analysis (PCA) and visualize the reduced dimensions.
Explanation:
1. Purpose of PCA:
o PCA reduces the dimensionality of data while preserving the variance.
o It helps visualize high-dimensional data in a lower-dimensional space.
2. Steps Involved:
o Load the sample data.
o Apply PCA to reduce the data to 2 principal components.
o Plot the reduced components using a scatter plot.
3. Why PCA is Useful:
o Reduces complexity in data.
o Highlights the most important patterns and relationships.
o Helps in clustering and visualization of high-dimensional data.
Program:
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt
Sample data (5 points, 2 dimensions)
data = np.array([[2, 3], [3, 4], [4, 5], [5, 6], [6, 7]])
Step 1: Apply PCA
pca = PCA(n_components=2)
transformed = pca.fit_transform(data)
Step 2: Plot PCA result
plt.scatter(transformed[:, 0], transformed[:, 1], color='red', marker='o')
plt.title('PCA Analysis')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
Step 3: Display the plot
plt.grid(True)
plt.show()
Output:
👉 PCA Scatter Plot:
X-axis → Principal Component 1
Y-axis → Principal Component 2
Red points → Reduced 2D data points
Experiment 3: Financial Analysis using Clustering, Histogram, and
Heatmap
Aim:
To analyze financial data using clustering, histogram, and heatmap.
Explanation:
1. Financial Data Overview:
o Financial data includes information like stock prices, trading volumes, and
returns.
o Analyzing this data helps understand market trends, correlations, and patterns.
2. Clustering:
o Clustering groups similar financial patterns together (e.g., similar stock
behaviors).
o KMeans clustering algorithm is commonly used.
3. Histogram:
o A histogram shows the distribution of financial values (e.g., stock prices).
o Helps identify the frequency of different value ranges.
4. Heatmap:
o A heatmap shows correlations between financial variables.
o High correlation values (near +1 or -1) indicate strong relationships between
variables.
Program :
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
Step 1: Create sample financial data
data = pd.DataFrame({
'Price': np.random.rand(10) * 100,
'Volume': np.random.rand(10) * 1000
})
Step 2: Clustering using KMeans
kmeans = KMeans(n_clusters=2, n_init=10)
data['Cluster'] = kmeans.fit_predict(data[['Price', 'Volume']])
Step 3: Plot histogram of Price
sns.histplot(data['Price'], bins=5, color='skyblue')
plt.title('Price Distribution')
plt.show()
Step 4: Plot heatmap for correlation
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.title('Heatmap of Financial Data')
plt.show()
Output:
👉 Histogram:
X-axis → Price range
Y-axis → Frequency of occurrence
Color → Sky blue
👉 Heatmap:
Shows correlation between Price and Volume
Strong positive/negative correlation will appear in shades of red/blue
Experiment 4: Time-Series Analysis – Stock Market
Aim:
To perform time-series analysis on stock market data and visualize trends over time.
Explanation:
1. Time-Series Data:
o Time-series data is a sequence of data points collected or recorded at specific time
intervals (e.g., daily stock prices).
o Analyzing time-series data helps identify patterns, trends, and seasonal effects.
2. Key Concepts:
o Trend: Long-term increase or decrease in values over time.
o Seasonality: Repeating patterns over fixed time intervals (e.g., quarterly cycles).
o Noise: Random variations in the data.
3. Steps Involved:
o Generate sample stock price data over a period of time.
o Plot the trend using a line graph.
o Highlight trends and patterns.
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Step 1: Create sample time-series data (Stock Prices)
np.random.seed(0)
dates = pd.date_range(start='2023-01-01', periods=30) # 30 days of data
prices = np.cumsum(np.random.randn(30) * 2 + 100) # Generate stock price with noise
Step 2: Create DataFrame
data = pd.DataFrame({'Date': dates, 'Stock Price': prices})
Step 3: Plot time-series data
plt.figure(figsize=(10, 5))
plt.plot(data['Date'], data['Stock Price'], marker='o', linestyle='-', color='green', label='Stock Price')
plt.title('Stock Market Trend')
plt.xlabel('Date')
plt.ylabel('Stock Price')
plt.xticks(rotation=45)
Step 4: Add legend and grid
plt.legend()
plt.grid(True)
plt.show()
Output:
👉 Time-Series Line Plot:
X-Axis: Dates (Jan 1, 2023 → Jan 30, 2023)
Y-Axis: Stock Prices
Line Color: Green
Markers: Circular points showing daily values
Trend: Gradual increase in stock price over time
Experiment 5: Visualization of Various Massive Datasets – Finance,
Healthcare, Census, and Geospatial
Aim:
To visualize large datasets using bar plots and scatter plots for analysis in finance, healthcare,
census, and geospatial data.
Explanation:
1. Massive Datasets:
o Large datasets may have millions of records.
o Visualization helps to identify trends, patterns, and outliers.
2. Types of Data:
o Finance: Stock prices, market trends, returns.
o Healthcare: Patient records, disease outbreaks, drug trials.
o Census: Population distribution, demographics, and migration.
o Geospatial: Location-based data, traffic patterns, environmental data.
3. Visualization Techniques:
o Bar Plot: For categorical and comparative data.
o Scatter Plot: For relationship analysis between two variables.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Step 1: Create sample dataset for Finance and Healthcare
data = pd.DataFrame({
'Category': ['Finance', 'Healthcare', 'Census', 'Geospatial'],
'Value': [120, 150, 80, 100]
})
Step 2: Bar Plot
plt.figure(figsize=(8, 5))
sns.barplot(x='Category', y='Value', data=data, palette='viridis')
plt.title('Data Value Across Different Domains')
plt.xlabel('Category')
plt.ylabel('Value')
plt.show()
Step 3: Scatter Plot for Geospatial Data
x = np.random.rand(50) * 100
y = np.random.rand(50) * 100
plt.figure(figsize=(8, 5))
plt.scatter(x, y, color='blue', edgecolors='black')
plt.title('Sample Geospatial Data')
plt.xlabel('Latitude')
plt.ylabel('Longitude')
plt.grid(True)
plt.show()
Output:
👉 Bar Plot:
X-axis → Category
Y-axis → Value
Color → Viridis color palette
👉 Scatter Plot:
X-axis → Latitude
Y-axis → Longitude
Data points → Random geospatial locations
Experiment 6: Visualization on Streaming Dataset (Stock Market
Dataset, Weather Forecasting)
Aim:
To visualize a real-time streaming dataset for stock market data and weather data.
Explanation:
1. Streaming Data:
o Data generated continuously over time (e.g., stock prices, weather updates).
o Requires real-time processing and visualization.
2. Types of Streaming Data:
o Stock Market Data: Stock prices, trading volumes, market indices.
o Weather Data: Temperature, humidity, wind speed, and precipitation.
3. Visualization Techniques:
o Line Plot: To track changes over time.
o Scatter Plot: To find relationships between variables.
Program
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Step 1: Create sample streaming stock data (Simulated)
np.random.seed(1)
time = pd.date_range(start='2023-01-01', periods=50, freq='D')
stock_prices = np.cumsum(np.random.randn(50) * 2 + 100) # Simulated stock prices
temperature = np.random.rand(50) * 15 + 20 # Simulated temperature in Celsius
Step 2: Create DataFrame
data = pd.DataFrame({'Date': time, 'Stock Price': stock_prices, 'Temperature': temperature})
Step 3: Plot stock price as line plot
plt.figure(figsize=(10, 5))
plt.plot(data['Date'], data['Stock Price'], color='blue', label='Stock Price')
plt.title('Stock Market Trend (Streaming Data)')
plt.xlabel('Date')
plt.ylabel('Stock Price')
plt.xticks(rotation=45)
plt.legend()
plt.show()
Step 4: Plot temperature vs stock price as scatter plot
plt.figure(figsize=(8, 5))
plt.scatter(data['Stock Price'], data['Temperature'], color='red', label='Temperature vs Stock Price')
plt.title('Temperature vs Stock Price')
plt.xlabel('Stock Price')
plt.ylabel('Temperature')
plt.legend()
plt.grid(True)
plt.show()
Output:
👉 Stock Price Line Plot:
X-Axis: Date
Y-Axis: Stock Price
Trend: Blue line showing price changes over time
👉 Temperature vs Stock Price Scatter Plot:
X-Axis: Stock Price
Y-Axis: Temperature
Color: Red data points
Experiment 7: Market Basket Data Analysis and Visualization
Aim:
To analyze and visualize market basket data using association rules and itemset frequency.
Explanation:
1. Market Basket Analysis:
o Technique used to identify patterns in customer purchases.
o Finds combinations of products frequently bought together.
2. Apriori Algorithm:
o Identifies frequent itemsets based on minimum support and confidence.
o Support → Frequency of item combinations.
o Confidence → Likelihood of purchasing items together.
3. Steps Involved:
o Create sample transaction data.
o Use Apriori algorithm to find frequent itemsets.
o Visualize the association rules using a heatmap or graph.
Program:
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
import seaborn as sns
import matplotlib.pyplot as plt
Step 1: Create sample transaction data
data = {
'Milk': [1, 0, 1, 1, 0, 1],
'Bread': [1, 1, 1, 0, 1, 1],
'Butter': [0, 1, 1, 1, 1, 0],
'Eggs': [1, 1, 0, 0, 1, 1],
'Cheese': [0, 0, 1, 1, 0, 1]
}
df = pd.DataFrame(data)
Step 2: Find frequent itemsets using Apriori
frequent_itemsets = apriori(df, min_support=0.3, use_colnames=True)
Step 3: Generate association rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
Step 4: Visualize association rules
plt.figure(figsize=(8, 5))
sns.heatmap(rules[['support', 'confidence', 'lift']], cmap='coolwarm', annot=True)
plt.title('Association Rules Heatmap')
plt.show()
Output:
👉 Heatmap:
X-Axis and Y-Axis: Support, Confidence, Lift
Color: Strength of the relationship between items
Interpretation: Higher values indicate stronger associations
Experiment 8: Text Visualization Using Web Analytics
Aim:
To visualize textual data from web analytics using word clouds and frequency plots.
Explanation:
1. Textual Data:
o Data collected from web logs, search queries, and user comments.
o Often unstructured and needs preprocessing.
2. Text Visualization:
o Word cloud → Displays the most frequent words in a dataset.
o Frequency plot → Shows the top words and their counts.
3. Steps Involved:
o Create sample web log data.
o Preprocess text (remove stopwords, punctuation).
o Generate a word cloud and frequency plot.
Program:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter
import seaborn as sns
Step 1: Create sample text data from web logs
text_data = """
User clicked on product page, user viewed details, user added to cart, user searched for mobile phones,
user purchased item, user viewed related items, user clicked on ads, user left feedback, user searched
for tablets,
user returned item, user added item to wishlist, user rated product.
"""
Step 2: Preprocess text data
words = text_data.replace(',', '').replace('.', '').lower().split()
word_counts = Counter(words)
Step 3: Generate Word Cloud
wordcloud = WordCloud(width=800, height=400,
background_color='white').generate_from_frequencies(word_counts)
Step 4: Display Word Cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud from Web Logs')
plt.show()
Step 5: Generate Frequency Plot
plt.figure(figsize=(8, 5))
sns.barplot(x=list(word_counts.keys())[:10], y=list(word_counts.values())[:10], palette='viridis')
plt.title('Top 10 Most Frequent Words in Web Logs')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.show()
Output:
👉 Word Cloud:
Displays the most common terms from the data.
Size of the word = Frequency of occurrence.
👉 Frequency Plot:
Top 10 most frequent terms.
X-Axis → Words
Y-Axis → Frequency