0% found this document useful (0 votes)
56 views195 pages

Data Visualization Techniques and Tools

This document outlines a module on Data Visualization and Data Exploration, emphasizing the importance of visualizing data for better understanding and insight extraction. It covers data wrangling processes, various visualization tools and libraries, and statistical concepts relevant to data analysis. Additionally, it introduces Python libraries like NumPy and pandas for data manipulation and visualization techniques, along with practical exercises for applying these concepts.

Uploaded by

Divyaraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views195 pages

Data Visualization Techniques and Tools

This document outlines a module on Data Visualization and Data Exploration, emphasizing the importance of visualizing data for better understanding and insight extraction. It covers data wrangling processes, various visualization tools and libraries, and statistical concepts relevant to data analysis. Additionally, it introduces Python libraries like NumPy and pandas for data manipulation and visualization techniques, along with practical exercises for applying these concepts.

Uploaded by

Divyaraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 195

Data Visualization

and Data Exploration

Module 4
Syllabus
• Introduction: Data Visualization, Importance of Data
Visualization, Data Wrangling, Tools and Libraries for
Visualization Comparison Plots: Line Chart, Bar Chart
and Radar Chart; Relation Plots: Scatter Plot, Bubble
Plot , Correlogram and Heatmap; Composition Plots: Pie
Chart, Stacked Bar Chart, Stacked Area Chart, Venn
Diagram; Distribution Plots: Histogram, Density Plot,
Box Plot, Violin Plot; Geo Plots: Dot Map, Choropleth
Map, Connection Map; What Makes a Good
Visualization?
Textbooks

2. Data Visualization workshop, Tim Grobmann


and Mario Dobler, Packt Publishing, ISBN
9781800568112
Course Outcome

CO4. Evaluate data visualization


tools and libraries and plot graphs.
Introduction

• People find it hard to make sense of lots of


numbers and random information, but they are
good at understanding pictures and visuals.

• Showing data in a visual way helps us


understand it better.
Introduction
• Python is a popular tool for working with data.

• It can help turn messy data into a clear format,


analyze it, and make it look good with charts and
graphs.

• This module will teach you how to use Python and some
special tools (like NumPy, pandas, Matplotlib, seaborn,
and geoplotlib) to create clear and useful pictures of
data.
Introduction to Data
Visualization
• Computers and phones store data like names and
numbers in a digital way.
• Data representation means how we store, process,
and share this data.
• Showing data visually, like with charts or graphs,
helps tell a story and highlight important findings.
• If we don't present data well, it loses its value.
• Good representations make information clear and
easy to understand.
• Data by itself isn't the same as useful information.
• Visual representations help us turn raw data into
insights that are easy to grasp and use.
The Importance of Data
Visualization
Visualizing data has many advantages, such as the
following:

• Complex data can be easily understood.

•A simple visual representation of outliers, target


audiences, and futures markets can be created.

• Storytelling can be done using dashboards and


animations.

• Data can be explored through interactive visualizations


Data Wrangling

• Data wrangling is the process of


transforming raw data into a
suitable representation for various
tasks.
The following steps explain the
flow of the data wrangling process:
1. First, the Employee Engagement data is in its raw form.

2. Then, the data gets imported as a DataFrame and is


later cleaned.

3. The cleaned data is then transformed into graphs, from


which findings can be derived.

4. Finally, we analyze this data to communicate the final


results.
Steps involved in Data
Wrangling
• Raw Data

• Importing Data

• Cleaning Data

• Transforming Data into Graphs

• Analysing Data

• Communicating Results
Raw Data:

• You start with raw data, which is unprocessed


and often messy.

• For example, let's say you have data about


employee engagement collected from surveys.
Importing Data
• This raw data is then imported into a tool or
program.

• In Python, we use a structure called a DataFrame


(from a library like pandas) to store and work with
the data.

• Imagine putting all your survey data into a big table


where each row is a survey response and each column
is a question or a piece of information about the
Cleaning Data

• The imported data often has errors, missing


values, or inconsistencies.

• Cleaning involves fixing these issues.

• For example, you might remove duplicate


responses, fill in missing answers, or correct
typos.
Transforming Data into Graphs:

• Once the data is clean, you can transform it into


visual formats like charts or graphs.

• This makes it easier to see patterns and trends.

• For instance, you might create a bar chart showing


the average engagement score for different
departments.
Analyzing Data

• With your graphs ready, you analyze them to draw


conclusions.

• You look at the patterns and trends to understand


what they mean.

• For example, you might notice that one department


has significantly higher engagement scores than
others.
Communicating Results

• Finally, you communicate your findings.

• This could be through reports, presentations, or


dashboards.

• The goal is to clearly convey the insights you’ve gained


from the data, such as which areas need improvement
in employee engagement.
Data wrangling process to measure
employee engagement
Tools and Libraries for
Visualization
• Coding Tools
• PYTHON
• R
• MATLAB
• Non Coding Tools
• TABLEAU
NOT PART OF THE SYLLABUS
• MATLAB(https://www.mathworks.com/products/matlab.html)
• R (https://www.r-project.org)
• Tableau (https://www.tableau.com)
Overview of Statistics
• Statistics is a combination of the analysis, collection,
interpretation, and representation of numerical data.

• Probability is a measure of the likelihood that an event will


occur and is quantified as a number between 0 and 1.

• A probability distribution is a function that provides the


probability for every possible event.

• A probability distribution is frequently used for statistical


analysis.
Overview of Statistics cont..
• The higher the probability, the more likely the
event is going to occur.
• There are two types of probability distributions
• Discrete
• Continuous
•A discrete probability distribution counts
occurrences that have countable or finite
outcomes.
• A probability distribution in which the random
variable X can take on any value (is
continuous).
Overview of Statistics cont..
Measures of Central Tendency
Mean
• The arithmetic average is computed by summing up all
measurements and dividing the sum by the number of
observations.
Overview of Statistics cont..
Measures of Central Tendency
Median
• It is the middle value of the ordered dataset.

• If there is an even number of observations, the


median will be the average of the two middle values.

• The median is less prone to outliers compared to


the mean, where outliers are distinct values in data
Overview of Statistics cont..
Measures of Central Tendency
Median
• It is the middle value of the ordered dataset.

• If there is an even number of observations, the


median will be the average of the two middle values.

• The median is less prone to outliers compared to


the mean, where outliers are distinct values in data
Overview of Statistics cont..
Measures of Central Tendency
Mode
• The mode is defined as the most frequent
value.
• There may be more than one mode in cases
where multiple values are equally frequent.

For example:
• A die was rolled 10 times, and we got the
following numbers: 4, 5, 4, 3, 4, 2, 1, 1, 2, and
1.
Overview of Statistics cont..
Measures of Dispersion
• Dispersion, also called variability, is the extent to
which a probability distribution is stretched or
squeezed.

• The different measures of dispersion are as follows:


• Variance
• Standard Deviation
• Range
Overview of Statistics cont..
Variance

• Variance is a measure of how spread out the numbers in a


data set are.

• It tells you how much the numbers differ from the average
(mean) value of the data set. If the variance is small, the
numbers are close to the mean. If the variance is large, the
numbers are spread out over a wider range.

• Variance is calculated as follows:

• Where  = mean
•N is the number of data points.
•xi is each individual data point.
Overview of Statistics cont..
Standard deviation
• This is the square root of the variance.
Range
• This is the difference between the largest and smallest
values in a dataset.
Interquartile range
• The Interquartile Range (IQR) is a measure of the spread of
the middle 50% of a data set.
• Also called the midspread or middle 50%, this is the
difference between the 75th and 25th percentiles, or between
the upper and lower quartiles.
Overview of Statistics cont..
Correlation
• Correlation describes the statistical relationship
between two variables.

• In a positive correlation, both variables move in the


same direction.

• In a negative correlation, the variables move in


opposite directions.

• In zero correlation, the variables are not related.


Overview of Statistics cont..
Types of Data
Types of data
Nominal Data:

• Nominal data is used to label variables without any


quantitative value.

• It is simply a way of categorizing data.

• Examples:
• Gender: Male, Female

• Marital Status: Single, Married, Divorced

• Types of Fruits: Apple, Banana, Orange


Types of data
Ordinal Data:

• Ordinal data is similar to nominal data, but the categories


have a meaningful order or ranking.

• Examples:
• Education Level: High School, Bachelor's, Master's, PhD

• Customer Satisfaction: Very Unsatisfied, Unsatisfied, Neutral,


Satisfied, Very Satisfied
• Movie Ratings: 1 star, 2 stars, 3 stars, 4 stars, 5 stars
Types of data
Temporal Data:

• Temporal data is related to time.

• It deals with any information that is recorded with


reference to specific times or dates.

• Examples:Dates and times of events (like a log of user


activities over time).

• Time series data (like daily temperatures or stock prices


over a year).
Types of data
Spatial Data:

• Spatial data is related to space. It deals with information


about the locations and shapes of objects, and their
relationships in space.

• Examples:
• Geographic locations (like the coordinates of cities on a map).

• Maps and routes (like the path of a delivery truck).


Types of data
• The following table gives an overview of which measure of
central tendency is best suited to a particular type of data.
Numpy
• To apply the basic mathematical and
statistical operations on the data, while
working with multidimensional arrays.

• It provides support for large n-dimensional


arrays and has built-in support for many
high-level mathematical and statistical
operations.
Note on numpy:

• Before NumPy, there was a library called


Numeric.

• However, it's no longer used, because offers


more features and better performance,
especially when dealing with large and high-
dimensional arrays.
ndarray

• NumPy introduced the ndarray, which is a


powerful data structure for handling large
and multi-dimensional arrays.

• It allows efficient storage and manipulation


of numerical data.
• Here, a and b are created
Example
using np.array().
import numpy as np
• This function converts the
# Creating arrays given list [1, 2, 3] and [4, 5,
a = np.array([1, 2, 3]) 6] into ndarray objects.
b = np.array([4, 5, 6])
• These ndarray objects allow
for efficient and optimized
numerical operations,
which are performed next.
# Adding arrays
c=a+b
print(c) # Output: [5 7 9]

• The addition operation a + b is performed element-


wise.
• This means that the first element of a is added to
the first element of b, the second element of a is
added to the second element of b, and so on.
• The result is stored in c, which is also an ndarray.
Numpy
• NumPy provides implementations of all the
mathematical operations we covered in this section
• Built in methods supported by Numpy are:
Mean:
np.mean(dataset) # mean value for the whole dataset
np.mean(dataset[0]) # mean value of the first row
np.mean(dataset[:, 0] # mean value of the whole first
column
np.mean(dataset[1, 0:10]) # mean value of the first 10
elements of the second row
Numpy
Median
np.median(dataset) # median value for the whole dataset

np.median(dataset[-1]) # median value of the last row using


reverse indexing

np.median(dataset[5:, 0]) # median value of values of rows >5


in the first column; means dataset[5:, 0] This selects all rows
from the 6th row onwards (since indexing starts from 0, the
5th index refers to the 6th row) and takes only the first column
of these rows.
Numpy
• Variance (Var): The variance describes how far a set of
numbers is spread out from their mean.

np.var(dataset) # variance value for the whole


dataset

np.var(dataset, axis=0) # axis used to get variance


per column

np.var(dataset, axis=1) # axis used to get variance


per row
Numpy

• Standard Deviation (std): One of the


advantages of the standard deviation is that
it remains in the scalar system of the data.

• This means that the unit of the deviation


will have the same unit as the data itself.
Numpy
• np.std(dataset) # standard deviation
for the whole dataset

• np.std(dataset[:2, :2]) # std value of


values from the 2 first rows and
columns

• np.std(dataset, axis=1) # axis used to


Exercise 1.01:

• Loading a Sample Dataset and Calculating


the Mean using NumPy
Activity 1.01:

• Using NumPy to Compute the Mean,


Median, Variance, and Standard Deviation
of a Dataset
Numpy
Basic NumPy Operations.
• Indexing

• Slicing

• Splitting

• Iterating
Indexing

• Indexing elements in a NumPy array, at


a high level, works the same as with
built-in Python lists.
• Elements can be indexed in multi-
dimensional matrices.
Indexing
Example:
dataset[0] # index single element in outermost dimension

dataset[-1] # index in reversed order in outermost


dimension

dataset[1, 1] # index single element in two-dimensional


data

dataset[-1, -1] # index in reversed order in two-


dimensional data
Indexing
• Indexing in NumPy arrays works similarly to Python
lists but extends to multi-dimensional data.

• Using positive and negative indices, you can access


elements in both regular and reversed order.

• This makes NumPy arrays versatile for data manipulation


and retrieval in multi-dimensional datasets.
Slicing

• Slicing has also been adapted from


Python's lists.

• It can be easily slice parts of lists into new


ndarrays, which is very helpful when
handling large amounts of data.
Example:
dataset[1:3] # rows 1 and 2
dataset[:2, :2] # 2x2 subset of the data.
dataset[-1, ::-1] # last row with elements
reversed.
dataset[-5:-1, :6:2] # last 4 rows, every
other element up to index 6
Note:

• The indexing operation dataset[:2, :2] actually selects the


first two rows and the first two columns of the dataset.

• In NumPy, slicing follows the pattern [start:stop], where


stop is exclusive.

• So :2 means "up to, but not including, index 2", which


includes indices 0 and 1.
Splitting
• Splitting data can be helpful in many situations, from
plotting only half of your time series data to
separating test and training data for machine
learning algorithms.

• There are two ways of splitting your data.

• Horizontally and vertically.

• Horizontal splitting can be done with the hsplit method.


Example:

np.hsplit(dataset, (3)) # split horizontally


in 3 equal lists

np.vsplit(dataset, (2)) # split vertically in


2 equal lists
Iterating
• Iterating the NumPy data structures, ndarrays, is
also possible.
• It steps over the whole list of data one after
another, visiting every single element in the
ndarrays once.
nditer & ndenumerate
• The nditer is a multi-dimensional iterator
object that iterates over a given number of
arrays

• The ndenumerate will give us exactly this


index, thus returning (0, 1). It will provide
you with both the index (in tuple form) and the
value at that index.
nditer
# iterating over whole dataset (each
value in each row)
for x in np.nditer(dataset):
print(x)
ndenumerate

# iterating over the whole dataset with


indices matching the position in the dataset

for index, value in np.ndenumerate(dataset):


print(index, value)
Advanced NumPy Operations
• Filtering
• Sorting
• Combining
• Reshaping
FILTERING
• Filtering in NumPy refers to selecting elements from an array
based on certain conditions.

dataset[dataset > 10] # values bigger than 10

np.extract((dataset < 3), dataset) # alternative – values


smaller than 3

dataset[(dataset > 5) & (dataset < 10)] # values bigger 5


and smaller 10

np.where(dataset > 5) # indices of values bigger than 5


SORTING
• Sorting each row of a dataset can be really useful.

• Using NumPy, we are also able to sort on other dimensions,


such as columns.

• argsort gives us the possibility to get a list of indices,


which would result in a sorted list

np.sort(dataset) # values sorted on last axis


np.sort(dataset, axis=0) # values sorted on axis 0
np.argsort(dataset) # indices of values in sorted list
Combining

• Stacking rows and columns onto an existing


dataset can be helpful when you have two datasets of
the same dimension saved to different files.
vstack & hstack
• vstack to stack dataset_1 on top of dataset_2.

• It will give us a combined dataset with all the rows


from dataset_1, followed by all the rows from
dataset_2.

• hstack, we stack our datasets "next to each other,"


meaning that the elements from the first row of
dataset_1 will be followed by the elements of the
first row of dataset_2.
Reshaping
• Reshaping in NumPy refers to changing the shape or dimensions of an
array without altering its data.

• This can be crucial for various algorithms that require data to be in a


specific shape.

• For instance, reshaping can help in reducing dimensionality for easier


visualization or preparing data for machine learning models.

dataset.reshape(-1, 2) # reshape dataset to two columns x rows


np.reshape(dataset, (1, -1)) # reshape dataset to one row x columns

-1 is an unknown dimension that NumPy identifies automatically.


Pandas
• The Pandas Python library provides data
structures and methods for manipulating
different types of data, such as numerical and
temporal data(collect through environ)

• These operations are easy to use and highly


optimized for performance.
Pandas

• Data formats, such as CSV and JSON, and


databases can be used to create DataFrames.

• DataFrames are the Internal Representations


of data and are very similar to tables but are
more powerful since they allow you to efficiently
apply operations such as multiplications,
aggregations, and even joins.
Pandas
• For Handling missing data, Pandas provide built-
in solutions to clean up and augment your
data, meaning it fills in missing values with
reasonable values.

• Integrated indexing and label-based slicing


in combination with fancy indexing (what we
already saw with NumPy) make handling data
simple.
Pandas
• More complex techniques, such as reshaping,
pivoting, and melting data, together with the
possibility of easily joining and merging data,
provide powerful tooling so that you can handle your
data correctly.

• While working with time-series data, operations such


as date range generation, frequency
conversion, and moving window statistics.
Note:

• The installation instructions for pandas can be found


here: https://pandas.pydata.org/

• The latest version is v0.25.3 (used in this book);


however, every v0.25.x should be suitable
Note: Only for your understanding
• Reshaping: Refers to the ability to restructure the layout of data. This could involve converting
data between long and wide formats, rearranging rows and columns, or pivoting data from one
shape to another.

• Pivoting: Involves reshaping data from a long format to a wide format or vice versa. For
example, converting data where each row represents a single observation with multiple
attributes into a table where attributes become columns.

• Melting: Refers to converting data from a wide format to a long format. This is useful when you
want to reshape data so that each row represents a unique observation and variable.

• Joining and merging: Refers to combining data from different sources based on common columns
or indices. Pandas provides efficient methods to merge datasets similar to SQL joins, such as
inner join, outer join, left join, and right join.

• Time-series operations: Pandas provides extensive support for working with time-series data.
This includes generating date ranges, converting frequencies (e.g., from daily data to monthly
data), and calculating statistics over rolling or moving windows of time (e.g., moving averages).
Advantages of pandas over
NumPy
• High level of abstraction: Pandas provides a simpler
and more user-friendly interface for working with data.

• Less intuition: Methods, such as joining, selecting, and


loading files, are used without much intuition.

• This means you can perform complex operations


without needing to understand every technical
detail, while still leveraging the full power of
Pandas.
Advantages of pandas over
NumPy
• Faster processing: The internal representation
of DataFrames allows faster processing for
some operations. Of course, this always
depends on the data and its structure

• Easy DataFrame design: DataFrames are


designed for operations with and on large
Disadvantages of pandas

• Less applicable: Due to its higher abstraction, it's


generally less applicable than NumPy. Especially
when used outside of its scope, operations can get
complex.

• More disk space: Due to the internal representation


of DataFrames and the way pandas trades disk space for
a more performant execution, the memory usage of
complex operations can spike.
Disadvantages of pandas
• Performance problems: Especially when doing heavy
joins, which is not recommended, memory usage can
get critical and might lead to performance problems.
• Hidden complexity: Less experienced users often
tend to overuse methods and execute them several
times instead of reusing what they've already
calculated. The Hidden complexity makes users think
that the operations themselves are simple, which is not
the case.
Exercise 1.04

• Loading a Sample Dataset and Calculating


the Mean using Pandas.
Exercise 1.06:

• Indexing, Slicing, and Iterating Using


pandas
Activity 1.02:
• Forest Fire Size and Temperature Analysis
Visualization
• Comparison Plots : comparing multiple
variables over time.
Ex: line charts, bar charts, and radar charts.
• Relation Plots: Used for showing relationships
among variables.
• Scatter Plots: Used to show relationship
between two variables
• Bubble plots: Used to show relationships for
three variables
Visualization
• Correlograms: Mainly used to show relationship
for variable pairs
• Heat maps: It is mainly used for visualizing
multivariate data.
• Composition plots: mainly used to visualize
variables that are part of a whole.
• Pie charts
• Stacked bar charts
• Stacked area charts and Venn diagrams.
Visualization

• Distribution plots: Used to visualize the


distribution of variables.(describing histograms,
density plots, box plots, and violin plots).

• Geoplots: Used mainly useful for visualizing


geospatial data.(Dot maps, Connection maps,
and Choropleth maps)
Comparison Plots
• Comparison plots include charts that are ideal for
comparing multiple variables or variables over time.

• Line charts are great for visualizing variables over


time.

• For comparison among items, bar charts (also called


column charts) are the best way to go.

• Radar charts or spider plots are great for visualizing


multiple variables for multiple groups.
Line Plots
• Line Chart : Line charts are used to display
quantitative values over a continuous time period
and show information as a series.

• A line chart is ideal for a time series that is connected


by straight-line segments.

• The value being measured is placed on the y-axis,


while the x-axis is the timescale.
The following diagram shows a trend of real estate prices (per
million US dollars) across two decades. Line charts are ideal for
showing data trends:
Line Plots
• Uses: Line charts are great for comparing multiple
variables and visualizing trends for both single as well as
multiple variables, especially if your dataset has many time
periods (more than 10).

• For smaller time periods, vertical bar charts might be the better
choice.

Design Practices:

• Avoid too many lines per chart.


The following figure is a multiple-variable line chart that compares the stock-closing
prices for Google, Facebook, Apple, Amazon, and Microsoft. A line chart is great for
comparing values and visualizing the trend of the stock. As we can see, Amazon shows the
highest growth:
Bar Charts
• Bar Chart: In this, the bar length encodes
the value.
• There are two variants of bar charts
• Vertical bar charts
• Horizontal bar charts.
• Both used to compare numerical values
across categories
• Vertical bar charts are sometimes used to
show a single variable over time.
Don'ts of Bar Charts:
• Don't confuse vertical bar charts with histograms.

• Bar charts compare different variables or categories,


while histograms show the distribution for a single
variable.

• Another common mistake is to use bar charts to show


central tendencies among groups or categories.

• Use box plots or violin plots to show statistical


measures or distributions in these cases.
Bar Charts
Bar Charts
Bar Charts
BAR CHARTS: Design Practices
1.Start at Zero: Always start the axis with numerical
values at zero to avoid misleading representations.

2.Use Horizontal Labels: Keep labels horizontal if


there are few bars and the chart isn’t too crowded.

3.Rotate Labels if Needed: If there’s not enough


space for horizontal labels, rotate them to fit better.
Radar Chart

• Radar charts (also known as spider or web charts)


visualize multiple variables with each variable
plotted on its own axis, resulting in a polygon.

• All axes are arranged radially, starting at the


center with equal distances between one
another, and have the same scale
Uses

• Radar charts are great for comparing multiple


quantitative variables for a single group or
multiple groups.

• They are also useful for showing which variables


score high or low within a dataset, making them
ideal for visualizing performance
Example:
The following diagram shows a radar chart for a single
variable. This chart displays data about a student
scoring marks in different subjects
The following diagram shows a radar chart for two
variables/groups. Here, the chart explains the marks
that were scored by two students in different
subjects:
Design Practices

• Try to display 10 factors or fewer on a single


radar chart to make it easier to read.

• Use faceting (displaying each variable in a


separate plot) for multiple variables/ groups, as
shown in the preceding diagram, in order to
maintain clarity
Activity 2.01: Employee Skill
Comparison
• You are given scores of four employees (Alex, Alice,
Chris, and Jennifer) for five attributes: efficiency,
quality, commitment, responsible conduct, and
cooperation.

• Your task is to compare the employees and their


skills.
Activity 2.01: Employee Skill Comparison

1. Which charts are suitable for this task?

2. You are given the following bar and radar


charts. List the advantages and disadvantages of
both charts. Which is the better chart for this task
in your opinion, and why?

3. What could be improved in the respective


visualizations?
The following diagram shows a bar chart for the employee skills
The following diagram shows a radar chart
for the employee skills:
Relation Plot
• Relation plots are perfectly suited to showing relationships among
variables.

• A scatter plot visualizes the correlation between two variables for one
or multiple groups.

• Bubble plots can be used to show relationships between three


variables. The additional third variable is represented by the dot size.

• Heatmaps are great for revealing patterns or correlations between two


qualitative variables.

• A correlogram is a perfect visualization for showing the


Scatter Plot
• Scatter plots show data points for two numerical
variables, displaying a variable on both axes.

Uses :

• You can detect whether a correlation (relationship) exists


between two variables.

• They allow you to plot the relationship between multiple


groups or categories using different colors.
Example
The following diagram shows a scatter plot of height
and weight of persons belonging to a single group:
The following diagram shows the same data as in the
previous plot but differentiates between groups. In this
case, we have different groups: A, B, and C:
The following diagram shows the correlation between body mass
and the maximum longevity for various animals grouped by
their classes. There is a positive correlation between body mass
and maximum longevity:
Design Practices

• Start both axes at zero to represent data


accurately.
• Use contrasting colors for data points
and avoid using symbols for scatter
plots with multiple groups or categories.
Variants: Scatter Plots with Marginal
Histograms

• In addition to the scatter plot, which visualizes the


correlation between two numerical variables, you can
plot the marginal distribution for each variable in
the form of histograms to give better insight into
how each variable is distributed
Examples
The following diagram shows the correlation between body mass
and the maximum longevity for animals in the Aves class.
Bubble Plot
• A bubble plot extends a scatter plot by introducing
a third numerical variable.

• The value of the variable is represented by the


size of the dots.

• The area of the dots is proportional to the value.

• A legend is used to link the size of the dot to an actual


numerical value
Use

• Bubble plots help to show a correlation


between three variables.
Example
The following diagram shows a bubble plot that highlights the
relationship between heights and age of humans to get the weight
of each person, which is represented by the size of the bubble:
Design Practices

• The design practices for the scatter plot are


also applicable to the bubble plot.

• Don't use bubble plots for very large


amounts of data, since too many
bubbles make the chart difficult to read
Correlogram
• A correlogram is a combination of scatter plots and
histograms.

• Histograms will be discussed in detail later in this chapter.

• A correlogram or correlation matrix visualizes the


relationship between each pair of numerical variables
using a scatter plot.
Correlogram

• The diagonals of the correlation matrix represent the


distribution of each variable in the form of a histogram.

• You can also plot the relationship between multiple groups


or categories using different colors.

• A correlogram is a great chart for exploratory data


analysis to get a feel for your data, especially the
correlation between variable pairs.
Examples
The following diagram shows a correlogram
for the height, weight, and age of
humans. The diagonal plots show a
histogram for each variable. The off-
diagonal elements show scatter plots
between variable pairs
Design Practices

• Start both axes at zero to represent data


accurately.

• Use contrasting colors for data points


and avoid using symbols for scatter plots
with multiple groups or categories
Note:
• Multivariate data refers to datasets with
multiple variables or attributes measured on
each individual or observation.

• Example: A dataset where each row


represents a person with attributes like age,
income, education level, and health status.
Heatmap
• A heatmap is a visualization where values contained
in a matrix are represented as colors or color
saturation.

• Heatmaps are great for visualizing multivariate data


(data in which analysis is based on more than two
variables per observation), where categorical
variables are placed in the rows and columns and
a numerical or categorical variable is represented
as colors or color saturation.
Use

• The visualization of multivariate data can


be done using heatmaps as they are great
for finding patterns in your data.
Examples

• The following diagram shows a heatmap for the most


popular products on the electronics category
page across various e-commerce websites, where the
color shows the number of units sold. In the following
diagram, we can analyze that the darker colors
represent more units sold, as shown in the key:
Variants:

Annotated Heatmaps

• Let's see the same example we saw


previously in an annotated heatmap, where
the color shows the number of units sold:
Design Practice

• Select colors and contrasts that will be easily visible


to individuals with vision problems so that your plots are
more inclusive.
Activity 2.02: Road Accidents
Occurring over Two Decades

Page 85
Composition Plots

• Composition plots are a type of data


visualization used to show how different
parts make up a whole.

• For static data, you can use pie charts,


stacked bar charts, or Venn diagrams.
Pie charts or donut charts
• Pie charts illustrate numerical proportions by dividing a
circle into slices.

• Each arc length represents a proportion of a


category.

• The full circle equates to 100%.

• For humans, it is easier to compare bars than arc


lengths; therefore, it is recommended to use bar
charts or stacked bar charts the majority of the time.
Use

•To compare items that are part


of a whole.
Examples: The following diagram shows
household water usage around the world:
Design Practices

• Arrange the slices according to their size


in increasing/decreasing order, either in
a clockwise or counterclockwise manner.

• Make sure that every slice has a different


color.
Variants: Donut Chart
• An alternative to a pie chart is a donut chart.

• In contrast to pie charts, it is easier to compare the


size of slices, since the reader focuses more on
reading the length of the arcs instead of the area.

• Donut charts are also more space-efficient because


the center is cut out, so it can be used to display
information or further divide groups into subgroups.
The following diagram shows a basic donut
chart:
The following diagram shows a donut
chart with subgroups:
Design Practice

• Use the same color that's used for the


category for the subcategories.

• Use varying brightness levels for the


different subcategories.
Stacked Bar Chart
• Stacked bar charts are used to show how a category is
divided into subcategories and the proportion of the
subcategory in comparison to the overall category.

• You can either compare total amounts across each bar or


show a percentage of each group.

• The latter is also referred to as a 100% stacked bar chart


and makes it easier to see relative differences between
quantities in each group.
Use

•To compare variables that can


be divided into sub-variables
Examples: The following diagram shows a generic
stacked bar chart with five groups:
The following diagram shows a 100% stacked bar
chart with the same data that was used in the
preceding diagram:
The following diagram illustrates the daily total
sales of a restaurant over several days. The daily
total sales of non-smokers are stacked on top of
the daily total sales of smokers:
Design Practices
• Use contrasting colors for stacked bars.

• Ensure that the bars are adequately spaced to


eliminate visual clutter. The ideal space guideline
between each bar is half the width of a bar.

• Categorize data alphabetically, sequentially, or


by value, to uniformly order it and make things
easier for your audience.
Stacked Area Chart
• Stacked area charts show trends for part-of-a-
whole relations.

• The values of several groups are illustrated by


stacking individual area charts on top of one
another.

• It helps to analyze both individual and overall


trend information.
Use

• To show trends for time series


that are part of a whole.
Examples The following diagram shows a stacked
area chart with the net profits of Google,
Facebook, Twitter, and Snapchat over a decade:
Design Practice

• Use transparent colors to improve


information visibility.

• This will help you to analyze the


overlapping data and you will also be able
to see the grid lines.
Activity 2.03: Smartphone Sales
Units

Refer Page 94
Venn Diagram

• Venn diagrams, also known as set diagrams, show


all possible logical relations between a finite
collection of different sets.

• Each set is represented by a circle.

• The circle size illustrates the importance of a group.

• The size of overlap represents the intersection


between multiple groups.
Use

•To show overlaps for different


sets.
Example Visualizing the intersection of the
following diagram shows a Venn diagram for
students in two groups taking the same
class in a semester:
Design Practice

• It is not recommended to use Venn


diagrams if you have more than three
groups.

• It would become difficult to understand.


Distribution Plots

• Distribution plots give a deep insight into


how your data is distributed.

• For a single variable, a histogram is


effective.

• For multiple variables, you can either use a


box plot or a violin plot.
Histogram
• A histogram is used because it visually shows how
data is distributed.

• It helps to see where most values are


concentrated, detect patterns, and identify any
outliers.

• It's a simple and effective way to understand the


overall shape and spread of the data.
Histogram
• A histogram visualizes the distribution of a single numerical
variable.

• A histogram shows how often different values occur by


using bars.

• Each bar represents how many times values fall into a


specific range.

• It helps you see where most values are and spot any
unusual ones.

• You can use different colors to compare multiple sets of values.


Use

• Get insights into the underlying


distribution for a dataset.
The following diagram shows the
distribution of the Intelligence Quotient
(IQ) for a test group. The dashed lines
represent the standard deviation each
side of the mean (the solid line)
A histogram is used here to show how IQ scores are spread out in a test group.
The bars help you see how many people have scores in different ranges. The
solid line shows the average IQ, and the dashed lines show how much scores
typically vary from the average. This helps you understand the overall pattern
and variation in IQ scores.
Design Practice

• Try different numbers of bins (data


intervals), since the shape of the histogram
can vary significantly.
Density Plot
• A density plot shows the distribution of a numerical
variable.

• It is a variation of a histogram that uses kernel


smoothing, allowing for smoother distributions.

• One advantage these have over histograms is that density


plots are better at determining the distribution
shape since the distribution shape for histograms heavily
depends on the number of bins (data intervals)
Use

• To compare the distribution of


several variables by plotting the
density on the same axis and using
different colors
Example The following diagram shows a
basic density plot:
The following diagram shows a basic
multi-density plot:
Design Practice

• Use contrasting colors to plot


the density of multiple variables.
Box Plot
• Box: The main part of the plot. It shows where the middle
50% of the data lies.

• Median Line: A line inside the box that shows the middle
value of the data.

• Whiskers: Lines that extend from the box to show the range
of the data outside the middle 50%.

• Outliers: Points beyond the whiskers, shown as circles or


diamonds, representing unusually high or low values.
Use
• Compare statistical measures for
multiple variables or groups.
Examples
The following diagram shows a basic box plot that
shows the height of a group of people:
The following diagram shows a basic box plot for multiple
variables. In this case, it shows heights for two different groups –
adults and non-adults:
Violin Plot
• Violin plots are a combination of box plots and density

plots.

• Both the statistical measures and the distribution are visualized.

• The thick black bar in the center represents the

interquartile range, while the thin black line corresponds

to the whiskers in a box plot.

• The white dot indicates the median.

• On both sides of the centerline, the density is visualized


Examples
The following diagram shows a violin plot for a
single variable and shows how students have
performed in Math:
From the preceding diagram, we can
analyze that most of the students
have scored around 40-60 in the
Math test.
The following diagram shows a violin plot for two
variables and shows the performance of students
in English and Math:
From the preceding diagram, we can
say that on average, the students
have scored more in English than in
Math, but the highest score was
secured in Math.
The following diagram shows a violin plot for a single
variable divided into three groups, and shows the
performance of three divisions of students in English
based on their score:
From the preceding diagram, we can note
that on average, division C has scored the
highest, division B has scored the lowest,
and division A is, on average, in between
divisions B and C.
Design Practice

• Scale the axes accordingly so that


the distribution is clearly visible and
not flat
Activity 2.04: Frequency of Trains
during Different Time Intervals

Page 104
Geoplots
• Geological plots are a great way to visualize
geospatial data.

• Choropleth maps can be used to compare


quantitative values for different countries,
states, and so on.

• If you want to show connections between


different locations, connection maps are the way
to go.
Dot Map

• In a dot map, each dot represents a certain number


of observations.

• Each dot has the same size and value (the number of
observations each dot represents).

• The dots are not meant to be counted; they are only


intended to give an impression of magnitude.
Use

• To visualize geospatial data.


The following diagram shows a dot map where
each dot represents a certain amount of bus stops
throughout the world:
Design Practices
• Do not show too many locations.
• You should still be able to see the map to get a
feel for the actual location.
• Choose a dot size and value so that in dense
areas, the dots start to blend.
• The dot map should give a good impression of
the underlying spatial distribution.
Choropleth Map

• In a choropleth map, each tile is


colored to encode a variable.
• For example, a tile represents a
geographic region for counties and
countries.
Use

• To visualize geospatial data grouped into


geological regions—for example, states or
countries.
Example The following diagram shows a
choropleth map of a weather forecast in the USA
Design Practices

• Use darker colors for higher values, as they are


perceived as being higher in magnitude.

• Limit the color gradation, since the human eye is limited


in how many colors it can easily distinguish between.
Seven color gradations should be enough.
Connection Map
• In a connection map, each line represents a certain number of
connections between two locations.

• The link between the locations can be drawn with a straight or


rounded line, representing the shortest distance between them.

• Each line has the same thickness and value (the number of
connections each line represents).

• The lines are not meant to be counted; they are only intended
to give an impression of magnitude.
Use

•To visualize connections.


Examples The following diagram shows a
connection map of flight connections around the
world:
Design Practices

• Do not show too many connections as it will be


difficult for you to analyze the data.

• Choose a line thickness and value so that the lines


start to blend in dense areas. The connection map
should give a good impression of the underlying spatial
distribution
What Makes a Good Visualization?

There are multiple aspects to what makes a good


visualization:

• Most importantly, the visualization should be self-


explanatory and visually appealing.

• To make it self-explanatory, use a legend, descriptive


labels for your x-axis and y-axis, and titles.
What Makes a Good Visualization?
• A visualization should tell a story and be designed for
your audience.
• Before creating your visualization, think about your
target audience.
• Create simple visualizations for a non-specialist
audience and more technical detailed visualizations for
a specialist audience.
• Think about a story to tell with your visualization so that
your visualization leaves an impression on the audience.
Thank you

You might also like