0% found this document useful (0 votes)

56 views195 pages

Data Visualization Techniques and Tools

This document outlines a module on Data Visualization and Data Exploration, emphasizing the importance of visualizing data for better understanding and insight extraction. It covers data wrangling processes, various visualization tools and libraries, and statistical concepts relevant to data analysis. Additionally, it introduces Python libraries like NumPy and pandas for data manipulation and visualization techniques, along with practical exercises for applying these concepts.

Uploaded by

Divyaraj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views195 pages

Data Visualization Techniques and Tools

Uploaded by

Divyaraj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 195

Data Visualization

and Data Exploration

Module 4
Syllabus
• Introduction: Data Visualization, Importance of Data
Visualization, Data Wrangling, Tools and Libraries for
Visualization Comparison Plots: Line Chart, Bar Chart
and Radar Chart; Relation Plots: Scatter Plot, Bubble
Plot , Correlogram and Heatmap; Composition Plots: Pie
Chart, Stacked Bar Chart, Stacked Area Chart, Venn
Diagram; Distribution Plots: Histogram, Density Plot,
Box Plot, Violin Plot; Geo Plots: Dot Map, Choropleth
Map, Connection Map; What Makes a Good
Visualization?
Textbooks

2. Data Visualization workshop, Tim Grobmann

and Mario Dobler, Packt Publishing, ISBN
9781800568112
Course Outcome

CO4. Evaluate data visualization

tools and libraries and plot graphs.
Introduction

• People find it hard to make sense of lots of

numbers and random information, but they are
good at understanding pictures and visuals.

• Showing data in a visual way helps us

understand it better.
Introduction
• Python is a popular tool for working with data.

• It can help turn messy data into a clear format,

analyze it, and make it look good with charts and
graphs.

• This module will teach you how to use Python and some
special tools (like NumPy, pandas, Matplotlib, seaborn,
and geoplotlib) to create clear and useful pictures of
data.
Introduction to Data
Visualization
• Computers and phones store data like names and
numbers in a digital way.
• Data representation means how we store, process,
and share this data.
• Showing data visually, like with charts or graphs,
helps tell a story and highlight important findings.
• If we don't present data well, it loses its value.
• Good representations make information clear and
easy to understand.
• Data by itself isn't the same as useful information.
• Visual representations help us turn raw data into
insights that are easy to grasp and use.
The Importance of Data
Visualization
Visualizing data has many advantages, such as the
following:

• Complex data can be easily understood.

•A simple visual representation of outliers, target

audiences, and futures markets can be created.

• Storytelling can be done using dashboards and

animations.

• Data can be explored through interactive visualizations

Data Wrangling

• Data wrangling is the process of

transforming raw data into a
suitable representation for various
tasks.
The following steps explain the
flow of the data wrangling process:
1. First, the Employee Engagement data is in its raw form.

2. Then, the data gets imported as a DataFrame and is

later cleaned.

3. The cleaned data is then transformed into graphs, from

which findings can be derived.

4. Finally, we analyze this data to communicate the final

results.
Steps involved in Data
Wrangling
• Raw Data

• Importing Data

• Cleaning Data

• Transforming Data into Graphs

• Analysing Data

• Communicating Results
Raw Data:

• You start with raw data, which is unprocessed

and often messy.

• For example, let's say you have data about

employee engagement collected from surveys.
Importing Data
• This raw data is then imported into a tool or
program.

• In Python, we use a structure called a DataFrame

(from a library like pandas) to store and work with
the data.

• Imagine putting all your survey data into a big table

where each row is a survey response and each column
is a question or a piece of information about the
Cleaning Data

• The imported data often has errors, missing

values, or inconsistencies.

• Cleaning involves fixing these issues.

• For example, you might remove duplicate

responses, fill in missing answers, or correct
typos.
Transforming Data into Graphs:

• Once the data is clean, you can transform it into

visual formats like charts or graphs.

• This makes it easier to see patterns and trends.

• For instance, you might create a bar chart showing

the average engagement score for different
departments.
Analyzing Data

• With your graphs ready, you analyze them to draw

conclusions.

• You look at the patterns and trends to understand

what they mean.

• For example, you might notice that one department

has significantly higher engagement scores than
others.
Communicating Results

• Finally, you communicate your findings.

• This could be through reports, presentations, or

dashboards.

• The goal is to clearly convey the insights you’ve gained

from the data, such as which areas need improvement
in employee engagement.
Data wrangling process to measure
employee engagement
Tools and Libraries for
Visualization
• Coding Tools
• PYTHON
• R
• MATLAB
• Non Coding Tools
• TABLEAU
NOT PART OF THE SYLLABUS
• MATLAB(https://www.mathworks.com/products/matlab.html)
• R (https://www.r-project.org)
• Tableau (https://www.tableau.com)
Overview of Statistics
• Statistics is a combination of the analysis, collection,
interpretation, and representation of numerical data.

• Probability is a measure of the likelihood that an event will

occur and is quantified as a number between 0 and 1.

• A probability distribution is a function that provides the

probability for every possible event.

• A probability distribution is frequently used for statistical

analysis.
Overview of Statistics cont..
• The higher the probability, the more likely the
event is going to occur.
• There are two types of probability distributions
• Discrete
• Continuous
•A discrete probability distribution counts
occurrences that have countable or finite
outcomes.
• A probability distribution in which the random
variable X can take on any value (is
continuous).
Overview of Statistics cont..
Measures of Central Tendency
Mean
• The arithmetic average is computed by summing up all
measurements and dividing the sum by the number of
observations.
Overview of Statistics cont..
Measures of Central Tendency
Median
• It is the middle value of the ordered dataset.

• If there is an even number of observations, the

median will be the average of the two middle values.

• The median is less prone to outliers compared to

the mean, where outliers are distinct values in data
Overview of Statistics cont..
Measures of Central Tendency
Median
• It is the middle value of the ordered dataset.

• If there is an even number of observations, the

median will be the average of the two middle values.

• The median is less prone to outliers compared to

the mean, where outliers are distinct values in data
Overview of Statistics cont..
Measures of Central Tendency
Mode
• The mode is defined as the most frequent
value.
• There may be more than one mode in cases
where multiple values are equally frequent.

For example:
• A die was rolled 10 times, and we got the
following numbers: 4, 5, 4, 3, 4, 2, 1, 1, 2, and
1.
Overview of Statistics cont..
Measures of Dispersion
• Dispersion, also called variability, is the extent to
which a probability distribution is stretched or
squeezed.

• The different measures of dispersion are as follows:

• Variance
• Standard Deviation
• Range
Overview of Statistics cont..
Variance

• Variance is a measure of how spread out the numbers in a

data set are.

• It tells you how much the numbers differ from the average
(mean) value of the data set. If the variance is small, the
numbers are close to the mean. If the variance is large, the
numbers are spread out over a wider range.

• Variance is calculated as follows:

• Where  = mean
•N is the number of data points.
•xi is each individual data point.
Overview of Statistics cont..
Standard deviation
• This is the square root of the variance.
Range
• This is the difference between the largest and smallest
values in a dataset.
Interquartile range
• The Interquartile Range (IQR) is a measure of the spread of
the middle 50% of a data set.
• Also called the midspread or middle 50%, this is the
difference between the 75th and 25th percentiles, or between
the upper and lower quartiles.
Overview of Statistics cont..
Correlation
• Correlation describes the statistical relationship
between two variables.

• In a positive correlation, both variables move in the

same direction.

• In a negative correlation, the variables move in

opposite directions.

• In zero correlation, the variables are not related.

Overview of Statistics cont..
Types of Data
Types of data
Nominal Data:

• Nominal data is used to label variables without any

quantitative value.

• It is simply a way of categorizing data.

• Examples:
• Gender: Male, Female

• Marital Status: Single, Married, Divorced

• Types of Fruits: Apple, Banana, Orange

Types of data
Ordinal Data:

• Ordinal data is similar to nominal data, but the categories

have a meaningful order or ranking.

• Examples:
• Education Level: High School, Bachelor's, Master's, PhD

• Customer Satisfaction: Very Unsatisfied, Unsatisfied, Neutral,

Satisfied, Very Satisfied
• Movie Ratings: 1 star, 2 stars, 3 stars, 4 stars, 5 stars
Types of data
Temporal Data:

• Temporal data is related to time.

• It deals with any information that is recorded with

reference to specific times or dates.

• Examples:Dates and times of events (like a log of user

activities over time).

• Time series data (like daily temperatures or stock prices

over a year).
Types of data
Spatial Data:

• Spatial data is related to space. It deals with information

about the locations and shapes of objects, and their
relationships in space.

• Examples:
• Geographic locations (like the coordinates of cities on a map).

• Maps and routes (like the path of a delivery truck).

Types of data
• The following table gives an overview of which measure of
central tendency is best suited to a particular type of data.
Numpy
• To apply the basic mathematical and
statistical operations on the data, while
working with multidimensional arrays.

• It provides support for large n-dimensional

arrays and has built-in support for many
high-level mathematical and statistical
operations.
Note on numpy:

• Before NumPy, there was a library called

Numeric.

• However, it's no longer used, because offers

more features and better performance,
especially when dealing with large and high-
dimensional arrays.
ndarray

• NumPy introduced the ndarray, which is a

powerful data structure for handling large
and multi-dimensional arrays.

• It allows efficient storage and manipulation

of numerical data.
• Here, a and b are created
Example
using np.array().
import numpy as np
• This function converts the
# Creating arrays given list [1, 2, 3] and [4, 5,
a = np.array([1, 2, 3]) 6] into ndarray objects.
b = np.array([4, 5, 6])
• These ndarray objects allow
for efficient and optimized
numerical operations,
which are performed next.
# Adding arrays
c=a+b
print(c) # Output: [5 7 9]

• The addition operation a + b is performed element-

wise.
• This means that the first element of a is added to
the first element of b, the second element of a is
added to the second element of b, and so on.
• The result is stored in c, which is also an ndarray.
Numpy
• NumPy provides implementations of all the
mathematical operations we covered in this section
• Built in methods supported by Numpy are:
Mean:
np.mean(dataset) # mean value for the whole dataset
np.mean(dataset[0]) # mean value of the first row
np.mean(dataset[:, 0] # mean value of the whole first
column
np.mean(dataset[1, 0:10]) # mean value of the first 10
elements of the second row
Numpy
Median
np.median(dataset) # median value for the whole dataset

np.median(dataset[-1]) # median value of the last row using

reverse indexing

np.median(dataset[5:, 0]) # median value of values of rows >5

in the first column; means dataset[5:, 0] This selects all rows
from the 6th row onwards (since indexing starts from 0, the
5th index refers to the 6th row) and takes only the first column
of these rows.
Numpy
• Variance (Var): The variance describes how far a set of
numbers is spread out from their mean.

np.var(dataset) # variance value for the whole

dataset

np.var(dataset, axis=0) # axis used to get variance

per column

np.var(dataset, axis=1) # axis used to get variance

per row
Numpy

• Standard Deviation (std): One of the

advantages of the standard deviation is that
it remains in the scalar system of the data.

• This means that the unit of the deviation

will have the same unit as the data itself.
Numpy
• np.std(dataset) # standard deviation
for the whole dataset

• np.std(dataset[:2, :2]) # std value of

values from the 2 first rows and
columns

• np.std(dataset, axis=1) # axis used to

Exercise 1.01:

• Loading a Sample Dataset and Calculating

the Mean using NumPy
Activity 1.01:

• Using NumPy to Compute the Mean,

Median, Variance, and Standard Deviation
of a Dataset
Numpy
Basic NumPy Operations.
• Indexing

• Slicing

• Splitting

• Iterating
Indexing

• Indexing elements in a NumPy array, at

a high level, works the same as with
built-in Python lists.
• Elements can be indexed in multi-
dimensional matrices.
Indexing
Example:
dataset[0] # index single element in outermost dimension

dataset[-1] # index in reversed order in outermost

dimension

dataset[1, 1] # index single element in two-dimensional

data

dataset[-1, -1] # index in reversed order in two-

dimensional data
Indexing
• Indexing in NumPy arrays works similarly to Python
lists but extends to multi-dimensional data.

• Using positive and negative indices, you can access

elements in both regular and reversed order.

• This makes NumPy arrays versatile for data manipulation

and retrieval in multi-dimensional datasets.
Slicing

• Slicing has also been adapted from

Python's lists.

• It can be easily slice parts of lists into new

ndarrays, which is very helpful when
handling large amounts of data.
Example:
dataset[1:3] # rows 1 and 2
dataset[:2, :2] # 2x2 subset of the data.
dataset[-1, ::-1] # last row with elements
reversed.
dataset[-5:-1, :6:2] # last 4 rows, every
other element up to index 6
Note:

• The indexing operation dataset[:2, :2] actually selects the

first two rows and the first two columns of the dataset.

• In NumPy, slicing follows the pattern [start:stop], where

stop is exclusive.

• So :2 means "up to, but not including, index 2", which

includes indices 0 and 1.
Splitting
• Splitting data can be helpful in many situations, from
plotting only half of your time series data to
separating test and training data for machine
learning algorithms.

• There are two ways of splitting your data.

• Horizontally and vertically.

• Horizontal splitting can be done with the hsplit method.

•
Example:

np.hsplit(dataset, (3)) # split horizontally

in 3 equal lists

np.vsplit(dataset, (2)) # split vertically in

2 equal lists
Iterating
• Iterating the NumPy data structures, ndarrays, is
also possible.
• It steps over the whole list of data one after
another, visiting every single element in the
ndarrays once.
nditer & ndenumerate
• The nditer is a multi-dimensional iterator
object that iterates over a given number of
arrays

• The ndenumerate will give us exactly this

index, thus returning (0, 1). It will provide
you with both the index (in tuple form) and the
value at that index.
nditer
# iterating over whole dataset (each
value in each row)
for x in np.nditer(dataset):
print(x)
ndenumerate

# iterating over the whole dataset with

indices matching the position in the dataset

for index, value in np.ndenumerate(dataset):

print(index, value)
Advanced NumPy Operations
• Filtering
• Sorting
• Combining
• Reshaping
FILTERING
• Filtering in NumPy refers to selecting elements from an array
based on certain conditions.

dataset[dataset > 10] # values bigger than 10

np.extract((dataset < 3), dataset) # alternative – values

smaller than 3

dataset[(dataset > 5) & (dataset < 10)] # values bigger 5

and smaller 10

np.where(dataset > 5) # indices of values bigger than 5

SORTING
• Sorting each row of a dataset can be really useful.

• Using NumPy, we are also able to sort on other dimensions,

such as columns.

• argsort gives us the possibility to get a list of indices,

which would result in a sorted list

np.sort(dataset) # values sorted on last axis

np.sort(dataset, axis=0) # values sorted on axis 0
np.argsort(dataset) # indices of values in sorted list
Combining

• Stacking rows and columns onto an existing

dataset can be helpful when you have two datasets of
the same dimension saved to different files.
vstack & hstack
• vstack to stack dataset_1 on top of dataset_2.

• It will give us a combined dataset with all the rows

from dataset_1, followed by all the rows from
dataset_2.

• hstack, we stack our datasets "next to each other,"

meaning that the elements from the first row of
dataset_1 will be followed by the elements of the
first row of dataset_2.
Reshaping
• Reshaping in NumPy refers to changing the shape or dimensions of an
array without altering its data.

• This can be crucial for various algorithms that require data to be in a

specific shape.

• For instance, reshaping can help in reducing dimensionality for easier

visualization or preparing data for machine learning models.

dataset.reshape(-1, 2) # reshape dataset to two columns x rows

np.reshape(dataset, (1, -1)) # reshape dataset to one row x columns

-1 is an unknown dimension that NumPy identifies automatically.

Pandas
• The Pandas Python library provides data
structures and methods for manipulating
different types of data, such as numerical and
temporal data(collect through environ)

• These operations are easy to use and highly

optimized for performance.
Pandas

• Data formats, such as CSV and JSON, and

databases can be used to create DataFrames.

• DataFrames are the Internal Representations

of data and are very similar to tables but are
more powerful since they allow you to efficiently
apply operations such as multiplications,
aggregations, and even joins.
Pandas
• For Handling missing data, Pandas provide built-
in solutions to clean up and augment your
data, meaning it fills in missing values with
reasonable values.

• Integrated indexing and label-based slicing

in combination with fancy indexing (what we
already saw with NumPy) make handling data
simple.
Pandas
• More complex techniques, such as reshaping,
pivoting, and melting data, together with the
possibility of easily joining and merging data,
provide powerful tooling so that you can handle your
data correctly.

• While working with time-series data, operations such

as date range generation, frequency
conversion, and moving window statistics.
Note:

• The installation instructions for pandas can be found

here: https://pandas.pydata.org/

• The latest version is v0.25.3 (used in this book);

however, every v0.25.x should be suitable
Note: Only for your understanding
• Reshaping: Refers to the ability to restructure the layout of data. This could involve converting
data between long and wide formats, rearranging rows and columns, or pivoting data from one
shape to another.

• Pivoting: Involves reshaping data from a long format to a wide format or vice versa. For
example, converting data where each row represents a single observation with multiple
attributes into a table where attributes become columns.

• Melting: Refers to converting data from a wide format to a long format. This is useful when you
want to reshape data so that each row represents a unique observation and variable.

• Joining and merging: Refers to combining data from different sources based on common columns
or indices. Pandas provides efficient methods to merge datasets similar to SQL joins, such as
inner join, outer join, left join, and right join.

• Time-series operations: Pandas provides extensive support for working with time-series data.
This includes generating date ranges, converting frequencies (e.g., from daily data to monthly
data), and calculating statistics over rolling or moving windows of time (e.g., moving averages).
Advantages of pandas over
NumPy
• High level of abstraction: Pandas provides a simpler
and more user-friendly interface for working with data.

• Less intuition: Methods, such as joining, selecting, and

loading files, are used without much intuition.

• This means you can perform complex operations

without needing to understand every technical
detail, while still leveraging the full power of
Pandas.
Advantages of pandas over
NumPy
• Faster processing: The internal representation
of DataFrames allows faster processing for
some operations. Of course, this always
depends on the data and its structure

• Easy DataFrame design: DataFrames are

designed for operations with and on large
Disadvantages of pandas

• Less applicable: Due to its higher abstraction, it's

generally less applicable than NumPy. Especially
when used outside of its scope, operations can get
complex.

• More disk space: Due to the internal representation

of DataFrames and the way pandas trades disk space for
a more performant execution, the memory usage of
complex operations can spike.
Disadvantages of pandas
• Performance problems: Especially when doing heavy
joins, which is not recommended, memory usage can
get critical and might lead to performance problems.
• Hidden complexity: Less experienced users often
tend to overuse methods and execute them several
times instead of reusing what they've already
calculated. The Hidden complexity makes users think
that the operations themselves are simple, which is not
the case.
Exercise 1.04

• Loading a Sample Dataset and Calculating

the Mean using Pandas.
Exercise 1.06:

• Indexing, Slicing, and Iterating Using

pandas
Activity 1.02:
• Forest Fire Size and Temperature Analysis
Visualization
• Comparison Plots : comparing multiple
variables over time.
Ex: line charts, bar charts, and radar charts.
• Relation Plots: Used for showing relationships
among variables.
• Scatter Plots: Used to show relationship
between two variables
• Bubble plots: Used to show relationships for
three variables
Visualization
• Correlograms: Mainly used to show relationship
for variable pairs
• Heat maps: It is mainly used for visualizing
multivariate data.
• Composition plots: mainly used to visualize
variables that are part of a whole.
• Pie charts
• Stacked bar charts
• Stacked area charts and Venn diagrams.
Visualization

• Distribution plots: Used to visualize the

distribution of variables.(describing histograms,
density plots, box plots, and violin plots).

• Geoplots: Used mainly useful for visualizing

geospatial data.(Dot maps, Connection maps,
and Choropleth maps)
Comparison Plots
• Comparison plots include charts that are ideal for
comparing multiple variables or variables over time.

• Line charts are great for visualizing variables over

time.

• For comparison among items, bar charts (also called

column charts) are the best way to go.

• Radar charts or spider plots are great for visualizing

multiple variables for multiple groups.
Line Plots
• Line Chart : Line charts are used to display
quantitative values over a continuous time period
and show information as a series.

• A line chart is ideal for a time series that is connected

by straight-line segments.

• The value being measured is placed on the y-axis,

while the x-axis is the timescale.
The following diagram shows a trend of real estate prices (per
million US dollars) across two decades. Line charts are ideal for
showing data trends:
Line Plots
• Uses: Line charts are great for comparing multiple
variables and visualizing trends for both single as well as
multiple variables, especially if your dataset has many time
periods (more than 10).

• For smaller time periods, vertical bar charts might be the better
choice.

Design Practices:

• Avoid too many lines per chart.

The following figure is a multiple-variable line chart that compares the stock-closing
prices for Google, Facebook, Apple, Amazon, and Microsoft. A line chart is great for
comparing values and visualizing the trend of the stock. As we can see, Amazon shows the
highest growth:
Bar Charts
• Bar Chart: In this, the bar length encodes
the value.
• There are two variants of bar charts
• Vertical bar charts
• Horizontal bar charts.
• Both used to compare numerical values
across categories
• Vertical bar charts are sometimes used to
show a single variable over time.
Don'ts of Bar Charts:
• Don't confuse vertical bar charts with histograms.

• Bar charts compare different variables or categories,

while histograms show the distribution for a single
variable.

• Another common mistake is to use bar charts to show

central tendencies among groups or categories.

• Use box plots or violin plots to show statistical

measures or distributions in these cases.
Bar Charts
Bar Charts
Bar Charts
BAR CHARTS: Design Practices
1.Start at Zero: Always start the axis with numerical
values at zero to avoid misleading representations.

2.Use Horizontal Labels: Keep labels horizontal if

there are few bars and the chart isn’t too crowded.

3.Rotate Labels if Needed: If there’s not enough

space for horizontal labels, rotate them to fit better.
Radar Chart

• Radar charts (also known as spider or web charts)

visualize multiple variables with each variable
plotted on its own axis, resulting in a polygon.

• All axes are arranged radially, starting at the

center with equal distances between one
another, and have the same scale
Uses

• Radar charts are great for comparing multiple

quantitative variables for a single group or
multiple groups.

• They are also useful for showing which variables

score high or low within a dataset, making them
ideal for visualizing performance
Example:
The following diagram shows a radar chart for a single
variable. This chart displays data about a student
scoring marks in different subjects
The following diagram shows a radar chart for two
variables/groups. Here, the chart explains the marks
that were scored by two students in different
subjects:
Design Practices

• Try to display 10 factors or fewer on a single

radar chart to make it easier to read.

• Use faceting (displaying each variable in a

separate plot) for multiple variables/ groups, as
shown in the preceding diagram, in order to
maintain clarity
Activity 2.01: Employee Skill
Comparison
• You are given scores of four employees (Alex, Alice,
Chris, and Jennifer) for five attributes: efficiency,
quality, commitment, responsible conduct, and
cooperation.

• Your task is to compare the employees and their

skills.
Activity 2.01: Employee Skill Comparison

1. Which charts are suitable for this task?

2. You are given the following bar and radar

charts. List the advantages and disadvantages of
both charts. Which is the better chart for this task
in your opinion, and why?

3. What could be improved in the respective

visualizations?
The following diagram shows a bar chart for the employee skills
The following diagram shows a radar chart
for the employee skills:
Relation Plot
• Relation plots are perfectly suited to showing relationships among
variables.

• A scatter plot visualizes the correlation between two variables for one
or multiple groups.

• Bubble plots can be used to show relationships between three

variables. The additional third variable is represented by the dot size.

• Heatmaps are great for revealing patterns or correlations between two

qualitative variables.

• A correlogram is a perfect visualization for showing the

Scatter Plot
• Scatter plots show data points for two numerical
variables, displaying a variable on both axes.

Uses :

• You can detect whether a correlation (relationship) exists

between two variables.

• They allow you to plot the relationship between multiple

groups or categories using different colors.
Example
The following diagram shows a scatter plot of height
and weight of persons belonging to a single group:
The following diagram shows the same data as in the
previous plot but differentiates between groups. In this
case, we have different groups: A, B, and C:
The following diagram shows the correlation between body mass
and the maximum longevity for various animals grouped by
their classes. There is a positive correlation between body mass
and maximum longevity:
Design Practices

• Start both axes at zero to represent data

accurately.
• Use contrasting colors for data points
and avoid using symbols for scatter
plots with multiple groups or categories.
Variants: Scatter Plots with Marginal
Histograms

• In addition to the scatter plot, which visualizes the

correlation between two numerical variables, you can
plot the marginal distribution for each variable in
the form of histograms to give better insight into
how each variable is distributed
Examples
The following diagram shows the correlation between body mass
and the maximum longevity for animals in the Aves class.
Bubble Plot
• A bubble plot extends a scatter plot by introducing
a third numerical variable.

• The value of the variable is represented by the

size of the dots.

• The area of the dots is proportional to the value.

• A legend is used to link the size of the dot to an actual

numerical value
Use

• Bubble plots help to show a correlation

between three variables.
Example
The following diagram shows a bubble plot that highlights the
relationship between heights and age of humans to get the weight
of each person, which is represented by the size of the bubble:
Design Practices

• The design practices for the scatter plot are

also applicable to the bubble plot.

• Don't use bubble plots for very large

amounts of data, since too many
bubbles make the chart difficult to read
Correlogram
• A correlogram is a combination of scatter plots and
histograms.

• Histograms will be discussed in detail later in this chapter.

• A correlogram or correlation matrix visualizes the

relationship between each pair of numerical variables
using a scatter plot.
Correlogram

• The diagonals of the correlation matrix represent the

distribution of each variable in the form of a histogram.

• You can also plot the relationship between multiple groups

or categories using different colors.

• A correlogram is a great chart for exploratory data

analysis to get a feel for your data, especially the
correlation between variable pairs.
Examples
The following diagram shows a correlogram
for the height, weight, and age of
humans. The diagonal plots show a
histogram for each variable. The off-
diagonal elements show scatter plots
between variable pairs
Design Practices

• Start both axes at zero to represent data

accurately.

• Use contrasting colors for data points

and avoid using symbols for scatter plots
with multiple groups or categories
Note:
• Multivariate data refers to datasets with
multiple variables or attributes measured on
each individual or observation.

• Example: A dataset where each row

represents a person with attributes like age,
income, education level, and health status.
Heatmap
• A heatmap is a visualization where values contained
in a matrix are represented as colors or color
saturation.

• Heatmaps are great for visualizing multivariate data

(data in which analysis is based on more than two
variables per observation), where categorical
variables are placed in the rows and columns and
a numerical or categorical variable is represented
as colors or color saturation.
Use

• The visualization of multivariate data can

be done using heatmaps as they are great
for finding patterns in your data.
Examples

• The following diagram shows a heatmap for the most

popular products on the electronics category
page across various e-commerce websites, where the
color shows the number of units sold. In the following
diagram, we can analyze that the darker colors
represent more units sold, as shown in the key:
Variants:

Annotated Heatmaps

• Let's see the same example we saw

previously in an annotated heatmap, where
the color shows the number of units sold:
Design Practice

• Select colors and contrasts that will be easily visible

to individuals with vision problems so that your plots are
more inclusive.
Activity 2.02: Road Accidents
Occurring over Two Decades

Page 85
Composition Plots

• Composition plots are a type of data

visualization used to show how different
parts make up a whole.

• For static data, you can use pie charts,

stacked bar charts, or Venn diagrams.
Pie charts or donut charts
• Pie charts illustrate numerical proportions by dividing a
circle into slices.

• Each arc length represents a proportion of a

category.

• The full circle equates to 100%.

• For humans, it is easier to compare bars than arc

lengths; therefore, it is recommended to use bar
charts or stacked bar charts the majority of the time.
Use

•To compare items that are part

of a whole.
Examples: The following diagram shows
household water usage around the world:
Design Practices

• Arrange the slices according to their size

in increasing/decreasing order, either in
a clockwise or counterclockwise manner.

• Make sure that every slice has a different

color.
Variants: Donut Chart
• An alternative to a pie chart is a donut chart.

• In contrast to pie charts, it is easier to compare the

size of slices, since the reader focuses more on
reading the length of the arcs instead of the area.

• Donut charts are also more space-efficient because

the center is cut out, so it can be used to display
information or further divide groups into subgroups.
The following diagram shows a basic donut
chart:
The following diagram shows a donut
chart with subgroups:
Design Practice

• Use the same color that's used for the

category for the subcategories.

• Use varying brightness levels for the

different subcategories.
Stacked Bar Chart
• Stacked bar charts are used to show how a category is
divided into subcategories and the proportion of the
subcategory in comparison to the overall category.

• You can either compare total amounts across each bar or

show a percentage of each group.

• The latter is also referred to as a 100% stacked bar chart

and makes it easier to see relative differences between
quantities in each group.
Use

•To compare variables that can

be divided into sub-variables
Examples: The following diagram shows a generic
stacked bar chart with five groups:
The following diagram shows a 100% stacked bar
chart with the same data that was used in the
preceding diagram:
The following diagram illustrates the daily total
sales of a restaurant over several days. The daily
total sales of non-smokers are stacked on top of
the daily total sales of smokers:
Design Practices
• Use contrasting colors for stacked bars.

• Ensure that the bars are adequately spaced to

eliminate visual clutter. The ideal space guideline
between each bar is half the width of a bar.

• Categorize data alphabetically, sequentially, or

by value, to uniformly order it and make things
easier for your audience.
Stacked Area Chart
• Stacked area charts show trends for part-of-a-
whole relations.

• The values of several groups are illustrated by

stacking individual area charts on top of one
another.

• It helps to analyze both individual and overall

trend information.
Use

• To show trends for time series

that are part of a whole.
Examples The following diagram shows a stacked
area chart with the net profits of Google,
Facebook, Twitter, and Snapchat over a decade:
Design Practice

• Use transparent colors to improve

information visibility.

• This will help you to analyze the

overlapping data and you will also be able
to see the grid lines.
Activity 2.03: Smartphone Sales
Units

Refer Page 94
Venn Diagram

• Venn diagrams, also known as set diagrams, show

all possible logical relations between a finite
collection of different sets.

• Each set is represented by a circle.

• The circle size illustrates the importance of a group.

• The size of overlap represents the intersection

between multiple groups.
Use

•To show overlaps for different

sets.
Example Visualizing the intersection of the
following diagram shows a Venn diagram for
students in two groups taking the same
class in a semester:
Design Practice

• It is not recommended to use Venn

diagrams if you have more than three
groups.

• It would become difficult to understand.

Distribution Plots

• Distribution plots give a deep insight into

how your data is distributed.

• For a single variable, a histogram is

effective.

• For multiple variables, you can either use a

box plot or a violin plot.
Histogram
• A histogram is used because it visually shows how
data is distributed.

• It helps to see where most values are

concentrated, detect patterns, and identify any
outliers.

• It's a simple and effective way to understand the

overall shape and spread of the data.
Histogram
• A histogram visualizes the distribution of a single numerical
variable.

• A histogram shows how often different values occur by

using bars.

• Each bar represents how many times values fall into a

specific range.

• It helps you see where most values are and spot any
unusual ones.

• You can use different colors to compare multiple sets of values.

Use

• Get insights into the underlying

distribution for a dataset.
The following diagram shows the
distribution of the Intelligence Quotient
(IQ) for a test group. The dashed lines
represent the standard deviation each
side of the mean (the solid line)
A histogram is used here to show how IQ scores are spread out in a test group.
The bars help you see how many people have scores in different ranges. The
solid line shows the average IQ, and the dashed lines show how much scores
typically vary from the average. This helps you understand the overall pattern
and variation in IQ scores.
Design Practice

• Try different numbers of bins (data

intervals), since the shape of the histogram
can vary significantly.
Density Plot
• A density plot shows the distribution of a numerical
variable.

• It is a variation of a histogram that uses kernel

smoothing, allowing for smoother distributions.

• One advantage these have over histograms is that density

plots are better at determining the distribution
shape since the distribution shape for histograms heavily
depends on the number of bins (data intervals)
Use

• To compare the distribution of

several variables by plotting the
density on the same axis and using
different colors
Example The following diagram shows a
basic density plot:
The following diagram shows a basic
multi-density plot:
Design Practice

• Use contrasting colors to plot

the density of multiple variables.
Box Plot
• Box: The main part of the plot. It shows where the middle
50% of the data lies.

• Median Line: A line inside the box that shows the middle
value of the data.

• Whiskers: Lines that extend from the box to show the range
of the data outside the middle 50%.

• Outliers: Points beyond the whiskers, shown as circles or

diamonds, representing unusually high or low values.
Use
• Compare statistical measures for
multiple variables or groups.
Examples
The following diagram shows a basic box plot that
shows the height of a group of people:
The following diagram shows a basic box plot for multiple
variables. In this case, it shows heights for two different groups –
adults and non-adults:
Violin Plot
• Violin plots are a combination of box plots and density

plots.

• Both the statistical measures and the distribution are visualized.

• The thick black bar in the center represents the

interquartile range, while the thin black line corresponds

to the whiskers in a box plot.

• The white dot indicates the median.

• On both sides of the centerline, the density is visualized

Examples
The following diagram shows a violin plot for a
single variable and shows how students have
performed in Math:
From the preceding diagram, we can
analyze that most of the students
have scored around 40-60 in the
Math test.
The following diagram shows a violin plot for two
variables and shows the performance of students
in English and Math:
From the preceding diagram, we can
say that on average, the students
have scored more in English than in
Math, but the highest score was
secured in Math.
The following diagram shows a violin plot for a single
variable divided into three groups, and shows the
performance of three divisions of students in English
based on their score:
From the preceding diagram, we can note
that on average, division C has scored the
highest, division B has scored the lowest,
and division A is, on average, in between
divisions B and C.
Design Practice

• Scale the axes accordingly so that

the distribution is clearly visible and
not flat
Activity 2.04: Frequency of Trains
during Different Time Intervals

Page 104
Geoplots
• Geological plots are a great way to visualize
geospatial data.

• Choropleth maps can be used to compare

quantitative values for different countries,
states, and so on.

• If you want to show connections between

different locations, connection maps are the way
to go.
Dot Map

• In a dot map, each dot represents a certain number

of observations.

• Each dot has the same size and value (the number of
observations each dot represents).

• The dots are not meant to be counted; they are only

intended to give an impression of magnitude.
Use

• To visualize geospatial data.

The following diagram shows a dot map where
each dot represents a certain amount of bus stops
throughout the world:
Design Practices
• Do not show too many locations.
• You should still be able to see the map to get a
feel for the actual location.
• Choose a dot size and value so that in dense
areas, the dots start to blend.
• The dot map should give a good impression of
the underlying spatial distribution.
Choropleth Map

• In a choropleth map, each tile is

colored to encode a variable.
• For example, a tile represents a
geographic region for counties and
countries.
Use

• To visualize geospatial data grouped into

geological regions—for example, states or
countries.
Example The following diagram shows a
choropleth map of a weather forecast in the USA
Design Practices

• Use darker colors for higher values, as they are

perceived as being higher in magnitude.

• Limit the color gradation, since the human eye is limited

in how many colors it can easily distinguish between.
Seven color gradations should be enough.
Connection Map
• In a connection map, each line represents a certain number of
connections between two locations.

• The link between the locations can be drawn with a straight or

rounded line, representing the shortest distance between them.

• Each line has the same thickness and value (the number of
connections each line represents).

• The lines are not meant to be counted; they are only intended
to give an impression of magnitude.
Use

•To visualize connections.

Examples The following diagram shows a
connection map of flight connections around the
world:
Design Practices

• Do not show too many connections as it will be

difficult for you to analyze the data.

• Choose a line thickness and value so that the lines

start to blend in dense areas. The connection map
should give a good impression of the underlying spatial
distribution
What Makes a Good Visualization?

There are multiple aspects to what makes a good

visualization:

• Most importantly, the visualization should be self-

explanatory and visually appealing.

• To make it self-explanatory, use a legend, descriptive

labels for your x-axis and y-axis, and titles.
What Makes a Good Visualization?
• A visualization should tell a story and be designed for
your audience.
• Before creating your visualization, think about your
target audience.
• Create simple visualizations for a non-specialist
audience and more technical detailed visualizations for
a specialist audience.
• Think about a story to tell with your visualization so that
your visualization leaves an impression on the audience.
Thank you

Mod 4
No ratings yet
Mod 4
115 pages
Python Data Visualization Guide
No ratings yet
Python Data Visualization Guide
75 pages
3 Data Visualization
No ratings yet
3 Data Visualization
75 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
Module 1: Data Visualization and Data Exploration: Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
No ratings yet
Module 1: Data Visualization and Data Exploration: Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
40 pages
Module 1 Importance of Data Visualization and Data Exploration 1
No ratings yet
Module 1 Importance of Data Visualization and Data Exploration 1
20 pages
Module4 DSV
No ratings yet
Module4 DSV
89 pages
Data Mining 2
No ratings yet
Data Mining 2
64 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
Understanding Descriptive Statistics
No ratings yet
Understanding Descriptive Statistics
45 pages
Lesson 02 Probability and Statistics
No ratings yet
Lesson 02 Probability and Statistics
127 pages
Data Analytics and Interactive Dashboards Using Python
No ratings yet
Data Analytics and Interactive Dashboards Using Python
96 pages
Data Analysis & Visualization Guide
No ratings yet
Data Analysis & Visualization Guide
63 pages
DVP Unit1
No ratings yet
DVP Unit1
44 pages
FIT1043 - Lecture 3 - 2024
No ratings yet
FIT1043 - Lecture 3 - 2024
69 pages
Data Management
No ratings yet
Data Management
36 pages
Descriptive Analytics & Visualization
No ratings yet
Descriptive Analytics & Visualization
39 pages
DVA Unit 1 - Part 2
No ratings yet
DVA Unit 1 - Part 2
53 pages
02 Data
No ratings yet
02 Data
36 pages
02 Exploratory Data Analytics
No ratings yet
02 Exploratory Data Analytics
41 pages
Lesson 2 Notes
No ratings yet
Lesson 2 Notes
11 pages
DataUnderstandingAndPreparation DOM304
No ratings yet
DataUnderstandingAndPreparation DOM304
19 pages
Topic 8 Data Processing and Analysis PDF
No ratings yet
Topic 8 Data Processing and Analysis PDF
157 pages
Unit 1b
No ratings yet
Unit 1b
69 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
Module 1
No ratings yet
Module 1
64 pages
Lect 3
No ratings yet
Lect 3
51 pages
VIPDMTheory Chapter 2
No ratings yet
VIPDMTheory Chapter 2
56 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
EDA - Unit 1
No ratings yet
EDA - Unit 1
82 pages
Module 4
No ratings yet
Module 4
91 pages
E-Note 33325 Content Document 20250319114322AM
No ratings yet
E-Note 33325 Content Document 20250319114322AM
69 pages
Statistics Introduction
No ratings yet
Statistics Introduction
37 pages
1 L2 Intro DAM
No ratings yet
1 L2 Intro DAM
27 pages
Data Managementmmw
No ratings yet
Data Managementmmw
26 pages
02 Kinds of Data
No ratings yet
02 Kinds of Data
41 pages
Article Review 1 Eng
No ratings yet
Article Review 1 Eng
30 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
Exploratory Data Analysis in Python
No ratings yet
Exploratory Data Analysis in Python
32 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
63 pages
Statistics: Types, Data, and Measures
No ratings yet
Statistics: Types, Data, and Measures
6 pages
Data Mining Concepts and Techniques
100% (1)
Data Mining Concepts and Techniques
63 pages
Computatm Solution
No ratings yet
Computatm Solution
6 pages
About Data
No ratings yet
About Data
25 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
02 Data
No ratings yet
02 Data
65 pages
DAI Data Preprocessing 1 46233380 2025 06 12 17 18
No ratings yet
DAI Data Preprocessing 1 46233380 2025 06 12 17 18
14 pages
Data Visualization
No ratings yet
Data Visualization
37 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
Understanding the Partogram in Labour
No ratings yet
Understanding the Partogram in Labour
27 pages
Zielinski - Cinema and TV
100% (1)
Zielinski - Cinema and TV
357 pages
Ritik Mishra Project
No ratings yet
Ritik Mishra Project
91 pages
Illnesses from Poor Workplace Lighting
No ratings yet
Illnesses from Poor Workplace Lighting
198 pages
Legal Dispute: Bitanga vs. Pyramid
No ratings yet
Legal Dispute: Bitanga vs. Pyramid
8 pages
1971 Eth
No ratings yet
1971 Eth
287 pages
IPL Match Statistics and Visualizations
No ratings yet
IPL Match Statistics and Visualizations
18 pages
Anachronauts - "Pilot"
No ratings yet
Anachronauts - "Pilot"
37 pages
كورس البنية التحتية الخاص بشكبة الجهد المتوسط وشبكة انارة الشوارع للمهندس محمد رمضان شلبي
50% (2)
كورس البنية التحتية الخاص بشكبة الجهد المتوسط وشبكة انارة الشوارع للمهندس محمد رمضان شلبي
91 pages
Development Experience 1947-90 and Economic Reforms Since 1991
No ratings yet
Development Experience 1947-90 and Economic Reforms Since 1991
18 pages
2023 UCER Ing Katalog
No ratings yet
2023 UCER Ing Katalog
62 pages
Unit II Methods To Initiate Ventures
No ratings yet
Unit II Methods To Initiate Ventures
28 pages
Nacmcf JFP 17 294
No ratings yet
Nacmcf JFP 17 294
27 pages
Food For Work Program in Bangladesh
No ratings yet
Food For Work Program in Bangladesh
31 pages
Investigation Data Form: Date Received: Nps Docket No.
No ratings yet
Investigation Data Form: Date Received: Nps Docket No.
4 pages
Suzuki Df4 Service Manual: Recoil Starter Component Parts and Important Item Illustrations General Information
No ratings yet
Suzuki Df4 Service Manual: Recoil Starter Component Parts and Important Item Illustrations General Information
9 pages
ISB Consulting Casebook Mckinsey
100% (6)
ISB Consulting Casebook Mckinsey
84 pages
D&D Berserker Barbarian Profile
No ratings yet
D&D Berserker Barbarian Profile
1 page
Class IX Mathematics Talent Exam Paper
100% (1)
Class IX Mathematics Talent Exam Paper
3 pages
Calculation Exchange Ratio
100% (1)
Calculation Exchange Ratio
10 pages
SpaceX Dragon Capsule ICES-2020-333
No ratings yet
SpaceX Dragon Capsule ICES-2020-333
11 pages
Notes For Final Period 1
100% (1)
Notes For Final Period 1
19 pages
Kuns Werkboek GR 6 Afr Kwart1
No ratings yet
Kuns Werkboek GR 6 Afr Kwart1
45 pages
Edureka Python Ebook
100% (3)
Edureka Python Ebook
21 pages
BOM Structure
0% (1)
BOM Structure
9 pages
Bootstrap and React
No ratings yet
Bootstrap and React
20 pages
7 Standards of Textuality
No ratings yet
7 Standards of Textuality
43 pages
Final Draft Review Petition in Shivraj Keshav Pawar
No ratings yet
Final Draft Review Petition in Shivraj Keshav Pawar
38 pages
Magazine Target Audience Analysis Guide
No ratings yet
Magazine Target Audience Analysis Guide
13 pages
Video Creation and Editing (Unit 1 & 2)
No ratings yet
Video Creation and Editing (Unit 1 & 2)
33 pages