Fundamentals of Data Handling and Visualization
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Editors
Er. Simran
April 2025
Copyright Editors
Editors: Dr. Sumit Chopra, Er. Navjot Kaur Basra, Er. Simran, Er. Debjit Mohapatra
ISBN: 978-93-48620-52-1
All rights reserved. No part of this publication may be reproduced or transmitted, in any
form or by any means, without permission. Any person who does any unauthorized act in
relation to this publication may be liable to criminal prosecution and civil claims for
damages.
Published by:
BHUMI PUBLISHING
Nigave Khalasa, Tal – Karveer, Dist – Kolhapur, Maharashtra, INDIA 416 207
E-mail: [email protected]
Disclaimer: The views expressed in the book are of the authors and not necessarily of the
publisher and editors. Authors themselves are responsible for any kind of plagiarism found
in their chapters and any related issues found with the book.
PREFACE
In today’s data-driven world, the ability to interpret and communicate insights
from data effectively is more vital than ever. As vast amounts of information are
generated every moment, the challenge lies not just in data collection but in making
sense of it and presenting it in a meaningful and accessible manner. Fundamentals of
Data Handling and Visualization is designed to provide readers with a foundational
understanding of the principles and practices that underpin effective data
visualization.
This book brings together essential concepts ranging from the basics of data
visualization to the use of advanced tools like Tableau, offering a comprehensive
roadmap for students, researchers, and professionals across disciplines. Starting with
an Introduction to Data Visualization, the text progresses to explore how data is
mapped onto various aesthetic dimensions to uncover patterns and relationships.
Readers will gain insights into how to represent trends and uncertainty, and how to
apply the principle of proportional ink, ensuring clarity and honesty in representation.
Further, the book highlights the significance of color usage, a critical
component in enhancing both readability and visual appeal. The latter half of the book
delves into Tableau, a leading data visualization tool, guiding the reader from basic
chart creation to advanced features and dashboard development. It also covers the art
of storytelling through data, emphasizing how narrative techniques can transform
data into compelling and persuasive stories. The final chapter reflects on emerging
trends in the field, preparing readers for the evolving future of data visualization.
Each chapter is structured to blend theoretical concepts with practical
applications, encouraging hands-on exploration and critical thinking. Whether you are
a beginner stepping into the world of data or a practitioner looking to enhance your
visualization skills, this book serves as a valuable resource to help you visualize data
not just beautifully, but meaningfully.
We hope that this book not only equips readers with technical know-how but
also inspires them to approach data visualization as both a science and an art.
- Editors
TABLE OF CONTENT
Chapter 1
INTRODUCTION TO DATA VISUALIZATION
Navjot Kaur Basra1, Davinder Singh2, Kiranjit Kaur3
1GNA University, Phagwara
2LKCTC, Jalandhar
3IKG PTU Main Campus, Kapurthala
1.1 Definition:
Data visualization is a graphical representation of information and data. Data visualization uses visual
components such as charts and graphs and maps to help individual users and organizations understand
trends, elaborate patterns and detect outliers in their datasets. Ultimately, it aims to make complex data
more digestible, comprehensible, and useful for decision-making [1].
Data visualization is not so much about making the data pretty as it is about making sure the data conveys
information. It helps one understand distributions, identify correlations, and detect underlying trends that
may not be obvious through exploring raw numeric data. For example, hospitals in the healthcare sector
rely on data visualization to maintain patient health records and predict disease outbreaks. The dashboard
(as shown in fig. 1.1) should offer real-time patient stats and assist medical staff in detecting patterns of
disease progression to facilitate resource allocation[1].
1
Bhumi Publishing, India
Categorical Data
Gender
Religion
Method of treatment
Type of teaching approach
Marital status
Qualifications
Native Language
Type of instruction
Problem-solving strategy used
Social classes
2
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
3. Time-Series Data:
Aside from the fact that time-series data sets are used to observe changes over time, it may include
sequential data like stock prices, weather patterns, and web traffic. Time-series data sets are arranged
in such a way as to record progress through chronological order with a timestamp or date on each
observation (as shown in fig. 1.5). Such a structure makes it possible to identify trends and patterns
over time.
3
Bhumi Publishing, India
5
Bhumi Publishing, India
6
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Example: Consider the heights of the seven students in a class recorded in the preceding
table. The only variable here is height and is not dealing with relationship or cause (as shown
in Table 1).
Table 1: Univariate Data
Heights (in cm) 164 167.3 170 174.2 178 180 186
The click rate can be assessed for men and women, and relationships among variables can then be
examined. It resembles bivariate except it has more than one dependent variable.
7
Bhumi Publishing, India
8
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Applications:
• It is used in financial analysis (e.g., stock price correlations).
• It helps in feature selection for machine learning.
2. Covariance Analysis: Covariance is a measure of how variables vary together. Covariance is not
standardized: it treats variables on different scales, which means that its value can be easily
affected by changes in scale.
• Positive Covariance: Variables move in the same direction.
• Negative Covariance: Variables move in opposite directions.
• Zero Covariance: No relationship.
• Formula:
̅)(𝐘𝐢 − 𝐘
∑(𝐗 𝐢 − 𝐗 ̅)
𝐜𝐨𝐯(𝐗, 𝐘) =
𝐧
Where:
• Xi, Yi = Data points
• Xˉ, Yˉ= Means of X and Y
• n = Number of observations
Example:
Covariance between revenue and advertising spend.
3. Regression Analysis: It establishes the relationship in predictive and inferential analyses
between potentially dependent variables and independent variables.
Types of Regression:
Linear Regression: Linear regression constitutes a statistical method used to model the
relationship between a dependent variable and one or more independent variables. It provides
insights that are valuable from a prediction and data analysis standpoint.
Formula:
𝐘 = 𝛃𝟎 + 𝛃𝟏𝐗 + 𝛜
Example: Predicting house prices based on square footage.
4. Logistic Regression: Logistic regression is a supervised machine learning algorithm used for
classification tasks where the goal is to predict the probability that an instance belongs to a given
class or not. Using the logistic regression model for binary classification and non-linear
transformations of continuous variables is often done using a sigmoid function, which takes
independent variables as inputs and gives outputs ranging from 0 to 1 as probabilities.
For example, we have two classes, Class 0 and Class 1 If the value of the logistic function for an
input is greater than 0.5 (threshold value), then it belongs to Class 1; otherwise, it belongs to
Class 0. It’s referred to as regression because it is the extension of linear regression, but is mainly
used for classification problems [7].
1.2.6 Visualization Methods for Exploring Data Relationships:
Scatter Plots: A scatter plot is a graph used in statistics and mathematics to plot data. The scatter plot, or
scatter diagram or scatter chart, uses dots to represent the values of two different numerical variables. The
position of each dot on the horizontal and vertical axes represents values for an individual data point (as
shown in fig. 1.8).
9
Bhumi Publishing, India
10
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Applications of heatmaps: Heat maps are versatile and can be used in different scenarios:
i. Website optimization to study user behaviours and optimize the design.
ii. Financial analysis to visualize performance metrics and point out growing areas in need of
improvement.
iii. Marketing to observe campaign performance and customer engagement.
iv. Scientific inquiry to analyze genetic data and other complex datasets.
v. Geographical analysis through graphical representations of spatial data, including population
density, crime rate, or welfare patterns.
vi. Sports analytics to physically analyze players' movement, game strategy, or performance metrics.
Line graph: A line graph juxtaposes two variables from a visual perspective, demonstrated by the x-axis
and y-axis. It illustrates the information by joining all the coordinates on a grid using a continuous line.
There are three different types of line graphs [9]. They include:
i. The simple line graph- one single line is presented on the graph.
ii. The multiple line graph- more than one line on the same set of axes. It is used for the effective
comparison of like items over the same period.
iii. The compound line graph-if the data can be divided into two or more types. This use of the term
is to refer to a line graph indicated by a separate additional line further topmost. A compound line
graph has lines drawn from the total to show the component part of the total. The top line
represents the total and below it is a line representing part of the total. The distance present
between two lines represents the size of each part (as shown in fig. 1.10).
11
Bhumi Publishing, India
It is possible, using bar graphs, to compare students' scores in different subjects as well as represent the
scores of one pupils in each subject. It is also very easy to produce a bar graph for every student
concerning all subjects.
12
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
13
Bhumi Publishing, India
Key Techniques:
a) Scaling and Normalization: Rescale numerical values to a uniform range (e.g., 0 to 1) to avoid
skewed charts. Example: Standardizing income values in thousands instead of full dollar
amounts.
b) Aggregation: Summarize large datasets to make visualizations more readable. Example:
Showing monthly sales instead of daily sales for better trend analysis.
c) Binning (Grouping Data into Ranges): Converts continuous data into categorical ranges.
Example: Grouping ages into "18-25", "26-35", "36-45", etc., for bar charts.
d) Feature Engineering: Creating new relevant variables to enhance visualization. Example:
Converting timestamps into "weekday vs. weekend" to analyse traffic patterns.
3. Formatting Data for Visualization Tools: Different visualization tools require data to be structured in
a specific way.
a) Tidy Data Format
• Each row represents a single observation.
• Each column represents a variable.
• Each cell contains a single value.
Example:
Date Product Sales Category
14
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
insights to be presented: charts for comparisons, line charts for trends, scatter plots for correlations, and
maps for geographic data. The elements of good design are clear labeling, uniform colors, and avoidance
of clutter to aid legibility. Interactivity as filters and tooltips, is a necessary engagement with a user in
tools like Tableau and Power BI. The completed visuals are then validated against an established accuracy
and effectiveness test to ensure that they convey the insight. To support data-driven decision-making, the
visual will be incorporated into reports, dashboards, or presentations [10].
1.3 Seven Stages of Data Visualization
15
Bhumi Publishing, India
16
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
• Uses Hierarchical or Graph-Based Storage: structuring types, these could often be either
JSON, XML, or NoSQL in nature.
• Requires special querying methods: one cannot query it in the conventional SQL mode; rather
operations can be performed using tools like XPath, XQuery, or NoSQL query languages.
iii. Unstructured Data:
Unstructured data refers to information that has no defined model, format, or organization.
Unstructured data is raw information and thus has to use advanced techniques for analysis
including machine learning, NLP, and AI. In contrast to structured data, which typically sits in
relational databases, or semi-structured data, like those bearing some organization (for instance,
XML or JSON), analysis of unstructured data takes a more advanced form.
Features of Unstructured Data:
• No Fixed Format: This means lacking a predefined set of rules, thus making it impossible to
store in traditional databases.
• Diverse and Complex- Information could be in the form of text, pictures, videos, or audio.
• High volume and Growth- Accounts over 80% of the world's store of data and is in constant
growth.
• Difficult to Query and Analyze- Need sophisticated tools such as artificial intelligence, natural
language processing, and deep learning to retrieve data that makes sense.
• Dense with Information- Such documents contain important indications concealed within the
raw text, speech, or multimedia content.
17
Bhumi Publishing, India
18
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
b. Ordinal Data:
Ordinal data is a type of qualitative (categorical) data that consists of categories ordered in a meaningful
way, but it does not guarantee that the interval between these categories is constant or can be measured.
This means ordinal data allows for comparison to the extent that one category can be said to be higher or
lower than another, but the exact distance between the categories cannot be said to exist (as shown in fig
1.21). Example:
a) Customer Satisfaction Ratings: A survey asks customers to rate their satisfaction with a
product:
• Very Dissatisfied (1)
• Dissatisfied (2)
• Neutral (3)
• Satisfied (4)
• Very Satisfied (5)
The responses have a clear ranking, but the difference between "Neutral" and "Satisfied" may not
be equal to the difference between "Dissatisfied" and "Neutral".
b) Education Levels: Levels of education ranked in order:
• High School
• Bachelor’s Degree
• Master’s Degree
• PhD
A PhD is ranked higher than a Master’s, but the difference in knowledge or skill between each
level is not exactly measurable.
c) Movie Ratings: When rating movies, people often use stars:
(1-star)
(2-stars)
(3-stars)
(4-stars)
(5-stars).
A 5-star movie is better than a 3-star movie, but the difference between 2-star and 3-star might not
be the same as between 4-star and 5-star.
d) Economic Class
• Low Income
• Middle Income
• High Income
19
Bhumi Publishing, India
v. Quantitative Data:
Quantitative data is nothing but numerical data that can be precisely measured, counted, and
expressed in numbers. It shows quantities, which makes it possible to perform mathematical
operations, including addition, subtraction, averaging, and the like.
Quantitative data answers questions like:
• How much?
• How many?
• How long?
• What is the numerical value of...?
For example:
• Student test scores: 85, 90, 78, 92
• Company revenue: $50,000, $75,000, $100,000
• Daily temperature: 25°C, 30°C, 28°C
Characteristics of Quantitative Data:
a) Numerical Representation: A way of showing things in numbers instead of words.
b) Quantifiable: Quantifiable in measurements like height, weight, income, and temperature.
c) Mathematical Operations: Subject to Arithmetic Operations like addition and subtraction or
mean, etc.
d) Stats analysis: Mean, median, standard deviation, correlation, etc.
Types of Quantitative Data
Quantitative data is categorized into two main types (as shown in fig 1.25).:
20
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Representation of Discrete Data:
21
Bhumi Publishing, India
a) Primary Source:
This refers to a handful of originally collected data from the source. It provides the researcher with fresh
or raw numerical information connected with the statistical study. To put it, primary sources of data give
the researcher direct access to this topic of investigation, such as statistical data, art pieces, and the
transcripts of interviews.
Manual Data Collection:
Manual data collection refers to collecting data by hand or without mechanical tools or software. This
involves any human endeavour that is utilized by a person to record, enter, validate, or organize
information extracted from sources ranging from surveys to observations, interviews, and written records.
Manual data collection has high levels of human error and a lot of wasted time, and it is applied only
when it is impossible to use automated procedures or when precision calls for human scrutiny.
Methods of Manual Data Collection:
1. Surveys & Questionnaires
• Data is collected by asking individuals specific questions through paper-based or digital forms.
• Examples: Customer satisfaction surveys, and employee feedback forms.
2. Observations
• Researchers manually record data by observing people, events, or processes.
• Examples: Tracking customer behaviour in a store, monitoring weather conditions.
3. Interviews & Focus Groups
• Information is collected by conducting in-person or virtual conversations.
• Examples: Job interviews, and market research discussions.
4. Data Entry from Documents
• Manually transcribing data from physical or digital sources into spreadsheets or databases.
• Examples: Entering patient records in hospitals, and compiling sales reports.
5. Direct Measurements
• Individuals record data from physical instruments or experiments.
• Examples: Taking temperature readings and manually counting foot traffic in a store.
b) Secondary Source:
The information presented in secondary sources is mostly a reprocessing of data obtained from
primary sources by either institutions or agencies having already obtained them from primary
research. This essentially means that the researcher does not have first-hand quantitative and raw
information related to their study. Thus, it is only through secondary sources of data that one may
interpret, describe, or synthesize information from primary sources. Examples include reviews,
government websites that publish surveys/data, academic books, published journals, and articles.
While primary sources lend credibility that is greatly based on evidence, well-rounded research will
require data collection from both primary and secondary sources.
Automated Data Collection:
Automated data collection describes the method of gathering and recording data by some software,
scripts, sensors, or algorithms, without the intervention of human beings. Automated data collection
is faster and more efficient and involves far fewer instances of human error than manual data
collection. In industries like finance, healthcare, marketing, and IoT (Internet of Things), automated
methods find their frequent application.
22
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Methods of Automated Data Collection
a) Web Scraping: Extracting data from websites using automated tools or scripts.
Examples:
• Gathering product prices from e-commerce websites.
• Extracting news headlines for sentiment analysis.
• Tools Used: BeautifulSoup, Scrapy, Selenium.
b) APIs (Application Programming Interfaces): Fetching data from online services in real time.
Examples:
• Google Analytics API for website traffic insights.
• Twitter API for collecting social media trends.
• Tools Used: REST APIs, GraphQL, Postman.
c) IoT & Sensor Data Collection: Devices automatically capture and transmit data.
Examples:
• Smart home sensors measure temperature and humidity.
• GPS trackers recording vehicle movements.
• Tools Used: Arduino, Raspberry Pi, AWS IoT.
d) Database & Log File Extraction: Collecting structured data from databases without manual
input.
Examples:
• Automatic data retrieval from SQL databases.
• Extracting user activity logs from servers.
• Tools Used: SQL queries, NoSQL databases, Logstash.
e) Optical Character Recognition (OCR): Converting printed or handwritten text into digital
format.
Examples:
• Scanning and digitizing invoices or receipts.
• Converting paper-based medical records into digital files.
• Tools Used: Tesseract OCR, Google Vision API.
f) AI & Machine Learning-Based Data Collection: Using AI to analyze and collect unstructured
data.
Examples:
• AI chatbots collecting customer feedback.
• Machine learning models extracting insights from images or videos.
• Tools Used: TensorFlow, OpenAI APIs, Natural Language Processing (NLP).
Example of Acquiring data:
A copy of the zip code listing can be found on the U.S. Census Bureau website, as it is frequently
used for geographic coding of statistical data. The listing is a freely available file with
approximately 42,000 lines, one for each of the codes, a tiny portion of which is shown in Figure
1.29.
23
Bhumi Publishing, India
Fig. 1.29: Zip codes in the format provided by the U.S. Census Bureau.
b) Parse: Once you have the data, it must be parsed and transformed into a form that assigns each
data component a purpose. The file must be split on each line into its components; in this case, it
must be delimited at every tab character. Then, each component of the data must be transformed
into a useful format. The composition of each line of the listing in the census is illustrated by
Figure 1.30 and needs to be comprehended before we can deparse it and extract from it what we
require [11].
24
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
The operations performed in the field of data transformation include data cleaning, data filtering,
data aggregating, data restructuring, and data normalization.
Steps in Data parsing:
1. Data Cleaning
• Purpose: Remove inconsistencies, duplicates, and errors.
• Methods:
• Handle missing values (fill, interpolate, or remove).
• Standardize data formats (date, currency, text).
• Remove duplicate entries.
2. Data Integration
• Purpose: Combine multiple data sources into a unified dataset.
• Methods:
• Merge databases (SQL JOIN, Pandas merge()).
• Consolidate multiple Excel sheets or CSV files.
• Integrate data from APIs or external sources.
3. Data Aggregation
• Purpose: Summarize detailed data for easier analysis.
• Methods:
• Calculate totals, averages, or counts.
• Group data by time periods (e.g., daily → monthly sales).
• Aggregate data at different levels (e.g., city → country).
4. Data Normalization & Scaling
• Purpose: Standardize values across different scales for accurate visualization.
• Methods:
• Min-Max Scaling: Normalize values between 0 and 1.
• Z-Score Normalization: Convert values to standard deviations from the mean.
• Log Transformation: Reduce the impact of large numerical differences.
5. Data Encoding & Formatting
• Purpose: Convert categorical data into numerical formats for analysis.
• Methods:
• One-Hot Encoding: Convert categorical values into binary columns.
• Label Encoding: Assign numerical values to categories (e.g., Male = 0, Female = 1).
• Reformat Dates & Times: Convert time zones and standardize formats (YYYY-MM-
DD).
6. Data Filtering & Selection
• Purpose: Remove unnecessary data to focus on relevant insights.
• Methods:
• Remove outliers that skew visualizations.
• Filter data based on conditions (e.g., exclude users with incomplete profiles).
• Select key columns needed for visualization.
Tools for Data parsing:
25
Bhumi Publishing, India
26
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
ii. Cross-check data sources for correctness.
Fig. 1.31: Mining the data: Just compare values to find the minimum and maximum.
Objectives of Data Mining
1. Identify hidden patterns and trends in the data.
2. Find correlations and relationships between variables.
3. Detect anomalies or outliers that may indicate unusual events.
27
Bhumi Publishing, India
28
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
g) Interact: The subsequent phase of the process introduces interaction, allowing the user to
manipulate or investigate the data. Interaction may include such things as the selection of a subset
of the data or viewpoint change. As a second example of a phase influencing an earlier
component of the process, this phase can influence the refinement step as well, since a change in
viewpoint may necessitate that the data be designed differently.
In the Zip decode project, entering a number selects all zip codes that start with that number. Fig
1.34 and Fig 1.35 illustrate all the zip codes starting with zero and nine, respectively [11].
Fig. 1.34: The user can alter the display through choices (zip codes starting with 0).
Fig. 1.35: The user can alter the display through choices (zip codes starting with 9).
Another addition to user interaction makes the users able to slide laterally through the display and
cycle through various of the prefixes. By holding down the Shift key after inputting part or all of
a zip code, users can overwrite the last number that has been typed in without pressing the Delete
key to go backwards.
Typing is a very rudimentary type of interaction, yet it enables the user to quickly develop an
impression of the structure of the zip code system. Just compare this sample application to the
effort required to infer the same information from a table of city names and zip codes.
29
Bhumi Publishing, India
You can still type digits and observe what the area looks like when covered by the next group of
prefixes. Fig 1.36 depicts the area under the two digits 02, Fig 1.37 depicts the three digits 021,
and Fig 1.38 depicts the four digits 0213. At last, Fig 1.39 depicts what happens when you input a
full zip code, 02139—a city name appears on the display [11].
Fig. 1.36: Refining Zip Code Representation with Two Digits (02).
30
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
31
Bhumi Publishing, India
1.4.1 Tableau:
Tableau is an excellent data visualization tool, a favourite of data analysts, scientists, statisticians,
and others for visualizations, to gain clearer opinions from their computations. Tableau is widely
known because it can ingest data and deliver the required data visualization output in the least
amount of time possible. In other words, it can elevate your data into insights that can be used to
drive actions in the future. It achieves all of this while granting the highest possible level of
security, and it guarantees that they deal with you for security issues as soon as they arise or are
brought to their attention by users.
Tableau is also capable of data preparation, cleansing, formatting, visualizations, sharing insights,
and publishing to other end-users. Through data queries, you can gather insights from your
visualizations, and these data are managed at an organizational scale using Tableau. In fact,
according to many, it is like a lifesaver for Business Intelligence since one can handle data
without having such high-end technical knowledge. Hence, Tableau may be utilized by
individuals or in a mass for the more significant benefits of business teams and their
organizations. Several organizations like Amazon, Lenovo, Walmart, Accenture, etc. already
utilize Tableau. There are different Tableau products for different types of users, be they single-
person-oriented or enterprise-oriented. Now, let's see what these may be in detail depicted in the
fig 1.41 [12].
32
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
# importing libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
ts = pd.Series(np.random.randn(1000), index = pd.date_range(
'1/1/2000', periods = 1000))
df = pd.DataFrame(np.random.randn(1000, 4), index = ts.index,
columns = list('ABCD'))
df3 = pd.DataFrame(np.random.randn(1000, 2),
columns =['B', 'C']).cumsum()
33
Bhumi Publishing, India
df3['A'] = pd.Series(list(range(len(df))))
df3.iloc[5].plot.bar()
plt.axhline(0, color ='k')
plt.show()
Output of the code is shown in Fig 1.43
Output:
34
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Output:
35
Bhumi Publishing, India
In other words, Microsoft Excel is the software created for organizing numbers and data with the
help of spreadsheets. Normally, its analysis is common all over the world and is used by
companies of different sizes to carry out financial analysis(as shown in fig 1.46) [17].
36
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Chapter 2
MAPPING DATA ONTO AESTHETICS
Simran1, Satyam Sharma2 and Gurpreet Kaur3
1,2GNA University, Phagwara
3Lovely Professional University, Jalandhar
Fig. 2.1: Commonly employed aesthetics in data visualization include position, shape, size, color,
line width, and line type. Some aspects can depict both continuous and discrete data (position, size,
line width, and color), while others can, for the most part, represent only discrete data (shape and
line type).
The position of a graphical element is very important, i.e., where the element is. In standard 2d graphics,
we refer to positions through an x and y value; however, other coordinate systems, including one or three-
dimensional visualizations, exist. Next, all graphical elements involve a certain shape, size, or color. Even
37
Bhumi Publishing, India
in a purely black-and-white drawing, graphical elements must have a certain color to, at the very least, be
visible- for instance, black if the background is white and white if it is black. Finally, as far as we
visualize the data using lines, they can display different widths or dash-dot patterns. New aesthetics may
further emerge in diverse situations beyond the examples shown in Figure 1: for instance, for textual
representation, we specify the font family, font face, and font size, or if graphical objects overlap, we
might have to specify whether they are somewhat transparent.
2.1.1 Position: Positions refer to the arrangement of data points along one or more axes in a chart or
graph. This is known to be one of the most fundamental ways to encode data visually, as the human mind
effortlessly interprets spatial variances.
• X-Axis (Horizontal): It represents independent variables, such as time, categories, or sequences.
• Y-Axis (Vertical): It represents dependent variables, such as quantity, percents, or concrete
numbers.
• Z-Axis (Depth in 3D graphs): One of the dimensions of information used in 3D visualization.
Types of Position:
a. Scatter Plot: Scatter plots are graphs that show the relationship between two variables in a
dataset. They present data points on a two-dimensional plane or a Cartesian coordinate system.
The independent variable or attribute is plotted on the X-axis, while the dependent variable is
plotted on the Y-axis. These are often referred to as scatter graphs or scatter diagrams.
Scatter plots provide immediate reports on large amounts of data. It can make its case on the
following points:
In the case of given points having an extremely large dataset. Each pair has a set of values. The
data under consideration is in numeric form (as shown in fig 2.2).
38
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
39
Bhumi Publishing, India
• Bars: A bar is about a value. It is either horizontal or vertical. The biggest bar is for the most
significant value.
• Axis Title: A bar graph contains two titles, one vertical and the other horizontal. Both axis
are connected. We can name the axis title for better understanding. Let's suppose the vertical
axis shows expenses. So, we can name the vertical axis as Expenses (in rupees). The expenses
could be of various types, so we can name types of expenses on the horizontal axis.
• Labels: We can also label the title of the horizontal axis. For instance, the categories of
expenses can be labeled as medical, transport, office, etc.
• Legends: A legend indicates what a bar is showing. It is also referred to as the key of a chart.
Take the following graph; if we put 2019 in the place of Series 1, then the blue bars of the
graph indicate the year 2019 data.
• Scale: The scale is used to represent the vertical values. It can contain rupees, population,
size, etc.
d. Bubble Chart (Position for Three Variables): A bubble chart (also referred to as a bubble plot)
is an expansion of the scatter plot that one employs to examine associations between three
numerical variables. In a bubble chart, one dot represents a single point of data, and for each
point, the values of the variables are represented by dot size, horizontal position, and vertical
position.
40
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Key Components of Color Mapping:
1. Hue → Different colors for categorical data.
2. Saturation → The intensity of a color (bold vs. faded).
3. Brightness → The lightness or darkness of a color (used in gradients).
Types of Color Mappings in Data Visualization:
1. Categorical Color Mapping (Distinct Colors for Groups):
Categorical colors assist users in projecting non-numeric meaning onto objects in a
visualization. They are created to be visually different from each other. The Spectrum
categorical 6-color palette has been made distinguishable for color vision-deficient users (as
shown in fig 2.6).
41
Bhumi Publishing, India
3. Diverging:
Diverging colors also carry numeric significance. They come in handy when working with
negative values or ranges that have two extremes and a baseline in the middle. Diverging
palettes are a set of 2 gradations of colors that converge in the middle.
Spectrum contains 3 palettes that are specifically meant to be used with diverging data (as
shown in fig 2.9):
• Orange-yellow-seafoam
• Red-yellow-blue
• Red blue
42
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
43
Bhumi Publishing, India
44
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
• Colors: Red, Blue, Green, Yellow
• Countries: USA, Canada, Germany, Australia
• Car Brands: Toyota, Ford, BMW, Honda
• Blood Types: A, B, AB, O
Techniques used for nominal data visualization:
a. Bar Charts: A bar chart, also known as a bar graph, gives a pictorial representation of numeric
values for the numeric levels of a categorical feature in terms of length. The levels of categorical
features and values are plotted on their axises in bar charts. Each level of the feature occupies its
rectangular bar as illustrated in fig.2.13. The height of a bar reflects the magnitude of the desired
percentage for this category. Bars are placed on the same baseline, allowing easy and accurate visual
comparison of values.
45
Bhumi Publishing, India
histograms. The color of each cell indicates the value of the main variable in the range to which it
corresponds.
46
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
2. Education Levels
Levels of education ranked in order:
• High School
• Bachelor’s Degree
• Master’s Degree
• PhD
A PhD is ranked higher than a Master’s, but the difference in knowledge or skill between each
level is not exactly measurable.
Techniques used in ordinal data visualization:
a. Line Charts: Line graphs are suitable for representing quantitative data, while they are also suitable for
ordinal qualitative data. They display changes in the value of the data through ordered values for a
criterion.
Construction of a line graph generally involves plotting points corresponding to each ordered value for
a criterion and then connecting these with lines as shown in fig 2.16:
Example: Customer Satisfaction Ratings Over Months
We have collected customer satisfaction ratings (ordinal data) over six months. The ratings are
categorized as:
• Very Unsatisfied (1)
• Unsatisfied (2)
• Neutral (3)
• Satisfied (4)
• Very Satisfied (5)
The data shows the average rating per month:
Month Average Satisfaction Rating (1-5)
January 3.2
February 3.5
March 3.8
April 4.0
May 4.2
June 4.5
47
Bhumi Publishing, India
1. X-Axis (Months) – Displays the months from January to June, showing the progression of time
(as shown in fig 2.16).
2. Y-Axis (Satisfaction Ratings) – Represents ordinal data (customer satisfaction levels) on a scale
from 1 (Very Unsatisfied) to 5 (Very Satisfied).
3. Line Trend – The line gradually increases, indicating an improvement in customer satisfaction
over time.
4. Markers (o) – Each point on the graph represents the average rating for that month, making it
easy to track changes.
b. Diverging Stacked Bar Charts: This type of chart is slightly more complex to display ordinal data,
especially Likert survey results when responses vary from negative to positive (strongly disagree to
strongly agree).
Example: Employee Job Satisfaction Survey
A company conducted a survey where employees rated their job satisfaction on a 5-point Likert scale as
shown in fig 2.17:
1. Strongly Disagree
2. Disagree
3. Neutral
4. Agree
5. Strongly Agree
Below is the survey data for different departments:
Department Strongly Disagree Neutral Agree Strongly
Disagree Agree
HR 5% 10% 15% 40% 30%
IT 10% 15% 20% 30% 25%
Marketing 8% 12% 25% 35% 20%
Finance 6% 9% 20% 40% 25%
48
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
1️. Negative Responses (Left Side in Red): Strongly Disagree & Disagree responses are displayed on
the left, creating a diverging effect. This helps visually separate negative feedback from positive
feedback.
2️. Neutral Responses (Center in Gray): The neutral responses are placed at the center, acting as a
reference point.
3️. Positive Responses (Right Side in Blue): Agree & Strongly Agree responses are displayed on the
right, showing positive sentiment.
4️. Vertical Reference Line at Zero (Neutral Point): The vertical black line at zero divides negative
and positive opinions.
5️. Comparison Across Departments: HR and Finance have higher positive satisfaction (longer blue
bars). IT and Marketing show a more balanced mix of satisfaction and dissatisfaction.
3. Discrete Numerical Data Visualization: Discrete numerical data have a countable set of distinct
value types (e.g., number of students in a class, total customer complaints per month). The methods
for visualization are particularly useful to evidence trends, comparisons, and distributions due to the
presence of gaps between the discrete data values.
Techniques used for visualizing discrete data:
a) Column Chart (Vertical Version of Bar Chart): A vertical variant of a bar chart used to
compare discrete numerical data across varying categories.
Example: Number of Products Sold Per Month:
A store tracks the number of electronic gadgets sold over six months as shown in fig 2.18:
Month Units Sold
January 120
February 150
March 180
April 140
May 200
June 170
49
Bhumi Publishing, India
50
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Techniques used for visualizing continuous numerical data:
a. Histogram: A histogram refers to a type of bar chart that provides a graphical representation
of the distribution of continuous numerical data by grouping the data into intervals or bins. It
gives an overall idea of the shape, dispersion, and central tendency of the data being
displayed.
Example: Distribution of Student Test Scores
A teacher records test scores of 50 students ranging from 40 to 100(as shown in fig 2.20)
51
Bhumi Publishing, India
1. The peak (mode) of the curve shows the most common score range.
2. The spread of the curve indicates the variability in scores.
3. The presence of multiple peaks could indicate different groups in the data (e.g., high and low
performers).
2.1 Visualizing Amounts
Bar charts, pie charts, and stacked area charts are three common forms of representing amounts in
data visualizations. They help compare quantities, proportions, and trends over time.
a. Bar Charts: A bar chart is one of the best methods to visualize amounts. It uses rectangular bars to
represent categories, whose height (for a vertical bar chart) or length (for a horizontal bar chart)
relates to the amount or frequency of that data.
Example: Monthly Sales Performance of a Retail Store
A retail store tracks its monthly sales revenue for the past six months to analyze trends and
performance. The store's sales data is:
Month Sales Revenue (in $)
January 25,000
February 30,000
March 28,000
April 35,000
May 40,000
June 38,000
Code:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Generate a bar graph for Monthly Sales Performance of a Retail Store
# Sample data
months = ["January", "February", "March", "April", "May", "June"]
sales_revenue = [25000, 30000, 28000, 35000, 40000, 38000] # Sales revenue in dollars
# Create the bar chart
plt.figure(figsize=(8, 5))
plt.bar(months, sales_revenue, color='blue')
# Labels and title
plt.xlabel("Months")
plt.ylabel("Sales Revenue ($)")
plt.title("Monthly Sales Performance of Retail Store")
# Show values on top of bars
for i, value in enumerate(sales_revenue):
plt.text(i, value + 500, str(value), ha='center', fontsize=10)
# Show grid and display plot
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.show()
Output of the code is shown in fig 2.22
52
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Output:
53
Bhumi Publishing, India
Output:
54
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
colors=["blue", "green", "red", "purple", "orange"], alpha=0.7)
# Labels and title
plt.xlabel("Months")
plt.ylabel("Website Traffic (Visitors)")
plt.title("Website Traffic Sources Over 6 Months")
plt.legend(loc="upper left")
# Display the stacked area chart
plt.show()
Output of the code is shown in fig 2.24
Output:
55
Bhumi Publishing, India
Code:
# Data for exam scores distribution
score_ranges = ["40-50", "50-60", "60-70", "70-80", "80-90", "90-100"]
student_counts = [3, 8, 12, 15, 10, 5]
# Define bin edges to match the score ranges
bins = [40, 50, 60, 70, 80, 90, 100]
# Generate sample data points within each range
exam_scores=np.concatenate([np.random.randint(bins[i], bins[i+1], student_counts[i]) for i in
range(len(bins)-1)])
# Create histogram
plt.figure(figsize=(8, 6))
plt.hist(exam_scores, bins=bins, color='blue', edgecolor='black', alpha=0.7)
# Labels and title
plt.xlabel("Exam Score Ranges")
plt.ylabel("Number of Students")
plt.title("Distribution of Exam Scores")
# Show grid
plt.grid(axis='y', linestyle='--', alpha=0.6)
# Display the histogram
plt.show()
Output of the code is shown in fig 2.25
Output:
56
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
b. Density Plots: Density plots are the alternate smoothed version of the histograms and display the
probability density function of the data, assigning higher probability densities to regions with higher
frequency of observations. Such are used with the aid of kernel density estimation to better visualize
the trend in the data.
Example: This company measures the refined outcome every day spent by a visitor on their website.
A density plot would help give a fair visualisation of the most generally occurring time ranges (e.g.,
most users spend time between 2-5 min).
Code:
# Generate sample data: user session durations in minutes
np.random.seed(42) # For reproducibility
session_durations = np.concatenate([
np.random.normal(3, 1, 300), # Most users stay around 3 minutes
np.random.normal(7, 2, 150), # Some users stay longer around 7 minutes
np.random.normal(12, 3, 50) # A few users stay much longer around 12 minute])
# Ensure no negative session durations
session_durations = session_durations[session_durations > 0]
# Create density plot
plt.figure(figsize=(8, 6))
sns.kdeplot(session_durations, fill=True, color="blue", alpha=0.6)
# Labels and title
plt.xlabel("Session Duration (Minutes)")
plt.ylabel("Density")
plt.title("Density Plot of Website Session Durations")
# Show grid
plt.grid(axis='y', linestyle='--', alpha=0.6)
# Display the density plot
plt.show()
Output of the code is shown in fig 2.26
Output:
57
Bhumi Publishing, India
a. The highest peak is around 3 minutes, meaning most users stay on the site for this duration.
b. A secondary peak of around 7 minutes suggests another group of users spends a bit more time.
c. A smaller peak of around 12 minutes indicates that a few users stay significantly longer.
d. The density gradually decreases after 15 minutes, showing that very few users stay beyond that.
2.4 Visualizing Propositions:
Visualizing propositions refers to the use of visual elements like graphs, charts or diagrams to present and
convey some propositions, which can be statements that can be true or false. Instead of using complex
computations and raw text, visualization techniques make it easier to understand logical structure, flow of
the data and decision-making processes. several fields like data science, artificial intelligence, philosophy,
mathematics, and business intelligence for an easier representation of logical rules, conditions, and
dependencies [20].
1. Propositions and Logical Statements
To ensure data integrity, improve interpretability, and structure visual insights, propositions and logical
statements are essential components of data visualisation. These components support the development of
rules, the production of conclusions from graphical representations, and the validation of those findings.
a. Propositions: A proposition is a statement that is either true (T) or false (F) but not both. In data
visualization, propositions are used to describe data patterns, relationships, and trends.
Propositions are declarative statements that have a particular truth value, which means that they are
either true (T) or false (F), but not both at the same time. In fields like data visualisation, artificial
intelligence, machine learning, and programming, propositions are essential components of logical
reasoning, mathematical proofs, and computational logic.
The fundamental logical relations and norms can be represented by manipulating and combining
proposition operators such as conjunction (∧), disjunction (∨), negation ¬, implication →, and
biconditional ↔.
Examples of proposition in Data Visualization:
1. First proposition: "Product A's sales grew in Q4." (Verifiable with a line or bar chart).
2. The second proposition states that "country X has a larger population than country Y." (A
choropleth map can be used for testing).
3. Proposition 3: "Missing values are present in the dataset." (Data profiling tools can be used for
validation).
4. A statement is not a proposition if it cannot be given a clear truth value.
Logical Connectives in Data Visualization
Logical connectives are used to combine multiple propositions and create complex statements that
influence decision-making in data visualization.
Table 2.1: Logical Connectivities
Logical Operator Symbol Meaning Example in Data Visualization
Negation ¬P NOT "It is NOT true that sales increased in Q4."
Conjunction P∧Q AND "Sales increased AND profits rose."
Disjunction P∨Q OR "Sales increased OR marketing expenses
decreased."
Implication P→Q if…then "If website traffic increases, then sales increase."
Biconditional P↔Q if and only if "Sales increase if and only if customer engagement
increases."
58
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Propositional logic's logical operators have a number of significant characteristics.
1. P ∧ Q ≡ Q ∧ P P ∨ Q ≡ Q ∨ P is commutative.
2. The second is associativity: (P ∧ Q) ∧ R ≡ P ∧ (Q ∧ R) (P ∨ Q) ∨ R ≡ P ∨ (Q ∨ R)
3. P ∧ (Q ∨ R) ≡ (P ∧ Q) ∨ (P ∧ R) is the distributivity.
4. Identity: P ∧ false ≡ P P ∧ true ≡ P
5. Domination: P ∧ false ≡ false P ∨ true ≡ true
6. Double Negation: P ≡ ¬ (¬P)
7. P ∧ P ≡ P P ∨ P ≡ P is Idempotence
Examples of Propositional Logic in Data Visualizations
1. Conditional Data Filtering & Highlighting
Use Case: Highlighting data points based on logical conditions.
● Example Proposition:
"If sales are above $10,000, then mark the region as profitable."
Logical Form: Sales > 10000 → Profitable
1. Import Required Libraries
import pandas as pd
import matplotlib.pyplot as plt
2. Create Sample Data
data = {"Region": ["North", "South", "East", "West"],
"Sales": [12000, 8000, 15000, 5000]}
df = pd.DataFrame(data)
Region Sales
North 12000
South 8000
East 15000
West 5000
59
Bhumi Publishing, India
Output:
60
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
b. Logical Statements
Logical statements are conditional words that affect the processing, classification, filtering, and
display of data in data visualisation. By highlighting connections, patterns, and anomalies, these
statements contribute to the increased significance of visualisations.
The following are common foundations for logical statements:
• Thresholds (e.g., indicating high vs. low numbers)
• Comparisons (e.g., condition-based data grouping)
• Boolean logic, such as colour coding using if-else situations
• Filtering (displaying only important information)
Examples of Logical Statements in Data Visualizations
Logical statements in data visualization are used to filter, categorize, highlight, or modify data
presentation based on conditions. Here are some practical examples using Python and libraries like
Matplotlib, Seaborn, and Pandas.
1. Conditional Formatting (Color Coding a Bar Chart)
Example: Highlighting sales performance based on a threshold.
Use Case: Differentiate between high and low-performing regions
import pandas as pd
import matplotlib.pyplot as plt
# Sample Data
data = {"Region": ["North", "South", "East", "West"], "Sales": [12000, 8000, 15000, 5000]}
df = pd.DataFrame(data)
# Apply logical condition for coloring
colors = ['green' if sales > 10000 else 'red' for sales in df["Sales"]]
# Create Bar Chart
plt.bar(df["Region"], df["Sales"], color=colors)
plt.xlabel("Region")
plt.ylabel("Sales")
plt.title("Sales Performance by Region")
plt.show()
Output of the code is shown in fig 2.29
Output:
61
Bhumi Publishing, India
62
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
2. Defining Sets of Features:
python_features = {"OOP", "Dynamic Typing", "Interpreted", "Libraries"}
java_features = {"OOP", "Strong Typing", "JVM", "Enterprise"}
cpp_features = {"OOP", "Strong Typing", "Compiled", "Performance"}
• We define sets of characteristics unique to Python, Java, and C++.
• These sets contain elements that represent key properties of each language.
3. Creating the Venn Diagram:
plt.figure(figsize=(6,6))
venn = venn3([python_features, java_features, cpp_features],('Python', 'Java', 'C++'))
• Using the venn3() function, we plot the intersection and differences of the three languages.
• Overlapping areas indicate shared features, while non-overlapping sections represent unique
features.
4. Displaying the Diagram:
# Add a title
plt.title("Feature Comparison of Python, Java, and C++")
plt.show()
• plt.show() is used to render the diagram.
63
Bhumi Publishing, India
64
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
1. Importing Required Libraries
import matplotlib.pyplot as plt
from matplotlib_venn import venn3
● matplotlib.pyplot is used for visualization and rendering the diagram.
● venn3 from matplotlib_venn helps us create a Venn diagram with three sets.
2. Defining the Sets and Their Intersections
The venn3 function takes a dictionary that specifies the sizes of the sets and their overlaps.
venn = venn3(subsets={'100': 10, '010': 8, '001': 6, '110': 4, '101': 3, '011': 2, '111': 1},
set_labels=('A', 'B', 'C'))
Each key in the dictionary represents a combination of set membership, where:
● '100' means only Set A (10 elements).
● '010' means only Set B (8 elements).
● '001' means only Set C (6 elements).
● '110' represents the intersection of Set A and Set B (4 elements).
● '101' represents the intersection of Set A and Set C (3 elements).
● '011' represents the intersection of Set B and Set C (2 elements).
● '111' represents the intersection of all three sets (1 element).
3. Adding a Title to the Diagram
plt.title("Euler Diagram Approximation with 3 Sets")
4. Displaying the Diagram
plt.show()
Output of the code is shown in fig 2.34
Output:
65
Bhumi Publishing, India
Overlap Shows all possible overlaps between May not show all overlaps, focusing
sets with shared regions. only on relevant ones.
Completeness Displays all possible relationships Can be partial, showing only relevant
between sets, even if some are empty. relationships without all intersections.
Expressiveness Limited to basic set operations like More expressive, able to represent
union, intersection, and difference complex relationships and
dependencies.
Complexity Becomes cluttered and harder to Handles more sets and complex
interpret with more than three sets. relationships in a clearer, often simpler
way.
2.5 Truth Table
A truth table is a table that shows all the possible combinations of truth values (true or false) for a set of
statements or propositions. Each row of the table represents a different scenario, and each column
represents a different statement or proposition. The table also shows the truth value of a compound
statement or proposition that is formed by combining the statements or propositions with logical
operators, such as and, or, not, if-then, and if and only if. For example, the following table shows the truth
values of the statements p, q, and p and q [22].
p q p and q
T T T
T F F
F T F
F F F
2.5.1 Truth Tables in Data Visualization:
A truth table is defined as a mathematical table used to fare the truth implies of logical propositions
depending on variable values. Truth tables can peek into data visualization in the following ways:
To Visualize in Binary: Values setting to true or false on all combinations of logical propositions in a
table/grids format: these could include AND, OR, or NOT.
Interactive Diagrams: Frameworks that represent truth values set up Portuguese dynamials are
executionable and visible to the user through both screen interfaces, displaying the user-set logical
outcome.
Venn Diagrams: These could stand for truths that show Venn's diagrams where intersection across the
set could imply logical relationships of conjunction (AND), disjunction (OR), or negation (NOT).
In this notebook, we will examine Boolean logic operations through a truth table and represent it visually
with a heatmap.
1. Importing Required Libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
66
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
• pandas is used for creating and managing tabular data (DataFrames).
• seaborn is a visualization library that helps in creating aesthetically pleasing plots.
• matplotlib.pyplot is used for rendering the plots.
2. Defining Boolean Values for A and B
We define the possible truth values for two Boolean variables, A and B:
A = [True, True, False, False]
B = [True, False, True, False]
Since Boolean values can be either True (1) or False (0), we enumerate all possible combinations for two
variables.
3. Constructing the Truth Table
# Create a DataFrame for the truth table
truth_table = pd.DataFrame({
'A': A,
'B': B,
'A AND B': [a and b for a, b in zip(A, B)],
'A OR B': [a or b for a, b in zip(A, B)],
'NOT A': [not a for a in A],
'NOT B': [not b for b in B]
})
● A AND B: Logical AND operation (True only if both A and B are True).
● A OR B: Logical OR operation (True if either A or B is True).
● NOT A: Negation of A (inverts the value of A).
● NOT B: Negation of B (inverts the value of B).
4. Displaying the Truth Table
print(truth_table)
5. Visualizing the Truth Table with a Heatmap:
plt.figure(figsize=(8, 4))
sns.heatmap(truth_table[['A AND B', 'A OR B', 'NOT A', 'NOT B']].astype(int),
annot=True, cmap='Blues', cbar=False)
plt.title('Truth Table Visualized as a Heatmap')
plt.show()
● plt.figure(figsize=(8, 4)): Sets the figure size for better readability.
● truth_table[['A AND B', 'A OR B', 'NOT A', 'NOT B']].astype(int): Converts Boolean values
(True/False) into integers (1/0) for visualization.
● sns.heatmap(..., annot=True, cmap='Blues', cbar=False):
○ annot=True displays values inside the heatmap.
○ cmap='Blues' assigns a blue color gradient.
○ cbar=False removes the color bar for simplicity.
Output of the code is shown in fig 2.35
67
Bhumi Publishing, India
Output:
68
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
3. Adding Nodes (Claims and Supporting Arguments)
G.add_node("We should eat healthy") # Main claim
G.add_node("Healthy food improves energy") # Supporting reason 1
G.add_node("Healthy food prevents diseases") # Supporting reason 2
4. ADDING EDGES (LOGICAL CONNECTIONS )
G.add_edge("We should eat healthy", "Healthy food improves energy")
G.add_edge("We should eat healthy", "Healthy food prevents diseases")
5. VISUALIZING THE ARGUMENT MAP
plt.figure(figsize=(6, 4))
pos = nx.spring_layout(G, seed=42) # Layout for better visualization
nx.draw(G, pos, with_labels=True, node_size=2000, node_color="lightgreen",
edge_color="black", font_size=10, font_weight="bold", arrows=True)
plt.title("Simple Argument Map")
plt.show()
Output of the code is shown in fig 2.36
Output:
69
Bhumi Publishing, India
G.add_node("Weight Control")
4. Adding Edges (Causal Relationships)
G.add_edge("Exercise", "Better Health")
G.add_edge("Healthy Diet", "Better Health")
G.add_edge("Exercise", "Weight Control")
G.add_edge("Weight Control", "Better Health")
● These edges represent cause-and-effect relationships:
• "Exercise" → "Better Health" → Exercise directly improves health.
• "Healthy Diet" → "Better Health" → A good diet contributes to better health.
• "Exercise" → "Weight Control" → Exercise helps maintain weight.
• "Weight Control" → "Better Health" → Weight control further improves health.
5. Visualizing the Causal Diagram
plt.figure(figsize=(8, 6))
pos = nx.spring_layout(G, seed=42) # Layout for better visualization
nx.draw(G, pos, with_labels=True, node_size=3000, node_color="lightblue", edge_color="black",
font_size=12, font_weight="bold", arrows=True, arrowsize=10)
plt.title("Causal Diagram: Factors Affecting Health")
plt.show()
Output of the code is shown in fig 2.37
Output:
70
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
• Spotting outliers in data.
• Identifying clusters or patterns in datasets.
Example: The given code is to use a scatter plot to show the link between sepal width and length in the
Iris dataset. The dataset is first loaded from a CSV file, and then Seaborn and Matplotlib are used to build
a scatter plot in kaggle notebook.
1. Importing Required Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
2. Loading the Dataset
file_path = '/kaggle/input/practica/iris.csv' # Ensure the file path is correct
df = pd.read_csv(file_path)
3. Creating a Scatter Plot
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x="sepal_length", y="sepal_width", hue="species", style="species",
palette="deep")
● plt.figure(figsize=(8, 5)) → Sets the figure size to 8x5 inches for better visibility.
● sns.scatterplot(...) → Creates a scatter plot using Seaborn:
• data=df → Uses the Iris dataset.
• x="sepal_length", y="sepal_width" → Plots sepal length (x-axis) against sepal
width (y-axis).
• hue="species" → Colors each point based on the species category.
• style="species" → Uses different marker styles for each species.
• palette="deep" → Uses a distinct color palette for better differentiation.
4. Adding Labels and Title
plt.xlabel("Sepal Length (cm)")
plt.ylabel("Sepal Width (cm)")
plt.title("Scatter Plot of Sepal Length vs Sepal Width (Iris Dataset)")
5. Displaying the Plot
plt.show()
Output of the code is shown in fig 2.38
Output:
71
Bhumi Publishing, India
The scatter plot visualizes the relationship between sepal length and sepal width in the Iris
dataset, differentiating species using colors and markers.
b. Regression Line
A regression line is a line which is used to describe the behavior of a set of data. In other words, it
gives the best trend of the given data. Regression lines are useful in forecasting procedures. Its
purpose is to describe the interrelation of dependent variables with one or many independent
variables such as X.
The equation derived from the regression line proves to be a guiding tool an analyst uses to
forecast the dependent variable future behaviors using different independent variable values.
72
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
3. Fitting the Linear Regression Model
model = LinearRegression()
model.fit(X, Y)
Y_pred = model.predict(X)
• LinearRegression() initializes a linear regression model.
• fit(X, Y) trains the model to find the best-fitting line for the data.
• predict(X) generates predicted Y values (i.e., regression line values).
4. Plotting the Data and Regression Line
plt.scatter(X, Y, color='blue', label="Actual Data")
plt.plot(X, Y_pred, color='red', label="Regression Line")
• plt.scatter(X, Y, color='blue') → Plots the actual data points in blue.
• plt.plot(X, Y_pred, color='red') → Draws the best-fit regression line in red.
5. Adding Labels and Title
plt.xlabel("X")
plt.ylabel("Y")
plt.title("Regression Line Example")
plt.legend()
plt.show()
• legend() → Adds a legend to differentiate the actual data points and the regression line.
Output of the code is shown in fig 2.40
Output:
73
Bhumi Publishing, India
74
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
○ Values range from -1 to 1.
○ 1 means perfect positive correlation.
○ 0 means no correlation.
○ -1 means perfect negative correlation.
4. Creating a Heatmap for Visualization
sns.heatmap(autocorrelation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
● sns.heatmap(...) generates a heatmap of the correlation matrix.
○ annot=True → Displays correlation values inside the heatmap.
○ cmap='coolwarm' → Uses the Cool-Warm color scheme (blue for negative, red for
positive correlations).
○ vmin=-1, vmax=1 → Defines the color scale from -1 to 1 for consistency.
5. Adding a Title and Displaying the Plot
plt.title('Autocorrelation Matrix Heatmap')
plt.show()
Output of the code is shown in fig 2.41
Output:
75
Bhumi Publishing, India
● Animation: Shows temporal changes, illustrating how data evolves over time (e.g.,
decades).
Example Use Cases of Bubble Charts
● Market Analysis: One can represent different companies according to revenue (X-axis), profit (Y-
axis), and market share (bubble size).
● Population Studies: Showing the countries based on GDP (X-axis), literacy rate (Y-axis),
population size (bubble size).
● Health Data: Showing the average BMI (X-axis), life expectancy (Y-axis), and pollution levels
(bubble size) in different cities.
● Financial Analysis: Comparing the stock performance of price (X-axis), growth rate (Y-axis), and
trading volume (bubble size).
Example: Let’s Create a bubble chart using matplotlib.pyplot in a kaggle notebook using python.
1. Import the Required Library
import matplotlib.pyplot as plt
2. Define the Data
products = ['A', 'B', 'C', 'D', 'E']
revenue = [50, 70, 90, 30, 60] # X-axis (Revenue in $1000s)
profit = [10, 25, 30, 8, 15] # Y-axis (Profit in $1000s)
sales_volume = [200, 450, 300, 150, 400] # Bubble size (Units sold)
3. Scale the Bubble Size
bubble_size = [s * 2 for s in sales_volume]
● Sales volume is scaled by multiplying each value by 2.
● This ensures bubbles are proportionate and visible on the chart.
4. Create the Bubble Chart
plt.figure(figsize=(8, 5))
plt.scatter(revenue, profit, s=bubble_size, alpha=0.5, c=['red', 'blue', 'green', 'purple', 'orange'],
edgecolors="black")
● revenue → Plots revenue values on the X-axis
● profit → Plots profit values on the Y-axis.
● s=bubble_size → Sets bubble sizes based on sales volume.
● alpha=0.5 → Makes bubbles slightly transparent (so overlapping is visible).
5. Add Labels to Each Bubble
for i, product in enumerate(products):
plt.text(revenue[i], profit[i], product, fontsize=12)
6. Add Axis Labels and Title
plt.xlabel("Revenue ($1000s)")
plt.ylabel("Profit ($1000s)")
plt.title("Bubble Chart: Revenue vs Profit vs Sales Volume")
7. Add Grid Lines
plt.grid(True, linestyle="--", alpha=0.5)
● Adds a grid with dashed ("--") lines and 50% transparency (alpha=0.5) for readability.
76
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
8. Display the Chart
plt.show()
Output of the code is shown in fig 2.42
Output:
77
Bhumi Publishing, India
DEFINITION A Scatter plot is usually a standard way to A grid of scatter plots showing
represent relationship between Two variables pairwise relation- ships between all
and is easier to interpret with focus on numerical variables in a dataset.
position of points across the x & y axis
USED FOR Comparing three variables (X, Y, and bubble Exploring relationships bet-ween
size) in a dataset multiple variables in a dataset.
NUMBER OF 3 (X-axis, Y-axis, and Bubble Size). More than 2 (plots relation- ships
VARIABLES between all numerical variables).
Business analytics, sales comparisons, Data exploration, correlation
BEST FOR financial data, and economic trends. analysis, and feature selec-ion in
machine learning.
LIBRARIES Matplotlib Seaborn
USED
VISUALIZAT Single chart with bubbles of different sizes Multiple scatter plots arranged in a
ION TYPE and colors. grid.
Showing revenue, profit, and market share of Exploring relationships between
EXAMPLE different companies. GDP, Literacy Rate, Life
Expectancy, etc.
78
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
2.9 Time-Series Analysis (Trends & Seasonality):
A time series is simply a set of data that is tabulated over some time, probably at uniform intervals.
The most generally known series of time data includes stock prices or foreign exchange rates of
financial information. Time series data are also capable of identifying meteorological information or
corresponding information like sales figures [24].
Time series can thus be either univariate or multivariate:
• Univariate time series data are concerned with time data based on one specific variable. Examples
include the price of a likely stock or the number of new cases of a disease seen in a day.
• Multivariate time series refers to time data based on different (independent) variables; a good
example of multivariate time series data would include weather data (which may include
temperature, humidity, and precipitation).
• Time series are plotted over time, and the analyses include the application of statistical methods.
But time series analysis forms the basis for future forecasting, giving a lot of insight into
understanding complex data.
There are two principal typologies of time series compositional data:
• Continuous data: This form of data, observed at equal intervals, can be represented in graphs as
a straight line-for instance, a thermometer-readout data point.
• Discrete data: This form recalls the observation of individual data points collected for specified
instances and possible point representation in graphs-for instance, survey informer.
Creating Time Series analysis using line chart:
Example: 1. Import Libraries
import matplotlib.pyplot as plt
import pandas as pd
2. Create Time Series Data
dates = pd.date_range(start="2023-01-01", periods=10, freq='D')
sales = [100, 120, 130, 125, 140, 145, 160, 180, 190, 200]
● pd.date_range() generates dates starting from January 1, 2023, for 10 days.
3. Plot the Line Chart
plt.plot(dates, sales, marker='o', linestyle='-', color='b', label="Sales Trend")
● plt.plot(x, y, options) → Creates the line chart.
● marker='o' → Marks each data point with a circle.
● linestyle='-' → Draws a solid line connecting points.
● color='b' → Sets the line color to blue.
● label="Sales Trend" → Adds a legend label.
4. Add Labels, Title, and Grid
plt.xlabel("Date") # X-axis label
plt.ylabel("Sales ($)") # Y-axis label
plt.title("Daily Sales Trend") # Chart title
plt.legend() # Show legend
plt.grid(True) # Add a grid
5. Display the Chart
plt.show()
Output of the code is shown in fig 2.44
79
Bhumi Publishing, India
Output:
80
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
2.10 Advanced Data Visualizations
Advanced data visualization is a branch of data visualization that involves representations of complex
datasets in visual formats using sophisticated tools and techniques. Apart from basic graphs and
charts, advanced visualization methods include interactive dashboards, heat maps, 3D plots, and real-
time data feeds. This level of interaction allows users to explore data in more detail and with greater
intuitiveness, facilitating the identification of trends, patterns, and anomalies. These techniques
provide businesses with sophisticated insights, leading to informed decision-making [25].
Advanced Data Visualization Techniques
1. Interactive Dashboards
An interactive dashboard is a dynamic tool that permits the user to manipulate data and see
different views in real time. It often combines multiple data visualizations into a single interface,
allowing users to drill in for further detail, filter data sets, and possibly monitor KPIs. This
interactivity allows for greater insight into the data and more informed business decisions.
2. Geospatial Visualization
Geospatial visualization is a process in which certain data points are placed on a geographical
location, allowing the reader to observe the relationship and patterns of data in space. It is
pertinent to organizations that operate in multiple regions or those that want to get a deeper
understanding of location-based data such as sales territories, customer demographics, or supply
chain visualization. Companies can see trends that would otherwise not be seen, thus making
informed location-based strategic decisions.
3. Heat Maps & Density Maps
Heat maps and density maps require the usage of color gradients representing data values, which
occupy two-dimensional space. These maps visualize the intensity of data points available in an
area or on some measurable scale: website clicks, customer activity, Usage of a resource, etc. The
heat map shows the concentration of the information point, allowing the business to understand
hotspots that would help in resource optimization.
2.11 Visualizing Geospatial Data
Geospatial Visualization is the process of displaying geographic data on maps to analyze patterns,
relationships, and trends. It is commonly used in GIS (Geographic Information Systems), urban
planning, climate studies, and business analytics.
Analysing geospatial data enables us to explore and find commonalities and relationships among the
items in our geographically modelled world. Information on the distance between two items, the
shortest path between them, the state of the area we monitor, and the terrain's height and land relief
are all provided by the components of geospatial analysis. A 2D or 3D model of a specific area can
then be made using this information. Business and public infrastructure decision-making are aided by
geospatial analysis. For instance, it can be used to determine if ambulances can reach any location
within a specified emergency response time or how they travel across a metropolis [25].
Through the use of maps, charts, and spatial analysis methods, geospatial visualisations turn location-
based data into insightful understandings. Additional geospatial visualisation types beyond the
fundamental ones are listed below:
1. Choropleth Maps: A map in which regions (states, districts, and countries) are shaded in various
colours according to a data value (such as GDP, COVID-19 cases, or population density).
81
Bhumi Publishing, India
They are Ideal for comparing information from different geographical areas. Example: A global map
with a colour gradient displaying GDP by nation .
Fig. 2.46: Choropleth map representing the number of COVID-19 cases worldwide, on
2020/06/01.
2. Point Map: A point map is the simplest way of seeing geospatial data. It involves plotting a point
anywhere on the map that corresponds to the variable you are measuring (for example, a landmark
such as a hospital).
This technique is effective in rehabilitating distribution and density patterns of an object, but it needs
an accurate collection or geocoding of location data to pinpoint each location directly on the map as
illustrated in the fig 2.47. The point technique can be unwieldy for a large-scale map, because points
may certainly overlap at some zoom level.
82
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
3. Proportional symbol map
Like the point map, this shows a feature at a particular location represented as a circle or some
other shape. At the same time, it can convey several other variables from one point using its size
and/or color (e.g., population and/or average age) as depicted in fig 2.48.
The proportional symbol maps allow integrating several variables at the same time. However,
they share the same limitation that point maps have: the effort to capture too many data points on
a large-scale map, particularly across relatively small geographic regions, can result in an overlap.
83
Bhumi Publishing, India
84
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Output:
85
Bhumi Publishing, India
86
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Chapter 3
VISUALIZING TRENDS AND UNCERTAINTY
Ramandeep Kaur1, Ritu Rani2, Sahib Singh3 and Navdeep Kaur4
1,3,4GNA University, Phagwara
2Rayat Bahra Group of Institutions and Nanotechnology, Hoshiarpur
87
Bhumi Publishing, India
88
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Output:
89
Bhumi Publishing, India
For example, in stock market analysis, smoothing techniques help investors distinguish between daily
price variations and genuine long-term growth trends. Similarly, in climate studies, smoothing is applied
to temperature records to observe global warming patterns over decades.
By using appropriate smoothing techniques, data visualization becomes more insightful, guiding better
decision-making across multiple domains.
2. Exponential smoothing
Time series methods assume a forecast is the sum of all past observations or lags. Exponential smoothing
gives more weight to the most recent observations and decreases exponentially as the observations get
further back in time; with the assumption the future will be like the recent past. The term "exponential
smoothing" means each demand observation is assigned an exponentially decreasing weight.
• This captures the general pattern and can be extended to include trends and seasonality to make
precise time series forecasts from past data.
• Gives a bit of long-term forecast errors.
• Works well with smoothing when time series parameters change slowly over time.
Types of Exponential Smoothing
1. Simple or Single Exponential smoothing
Simple smoothing is a method of forecasting time series using univariate data without a trend or
seasonality. You need only one parameter, which is also referred to as alpha (αα) or smoothing factor so
as to check how much the impact of past observations should be minimized.the weight to be given to the
current data as well as the mean estimate of the past depends on the smoothing parameter (αα). A smaller
value of a implies more weight on past prediction and vice-versa. The range of this parameter is typically
0 to 1.
The formula for simple smoothing is:
st=αxt+(1–α)st−1=st−1+α(xt–st−1)st=αxt+(1–α)st−1=st−1+α(xt–st−1)
where,
• stst = smoothed statistic (simple weighted average of current observation xt)
• st−1st−1 = previous smoothed statistic
• αα = smoothing factor of data; 0<α<10<α<1
• tt = time period
2. Double Exponential Smoothing
Double exponential smoothing, also known as the Holt’s trend model, or second-order smoothing, or
Holt’s Linear Smoothing is a smoothing method used to predict the trend of a time series when the data
does not have a linear trend but does not have a seasonal pattern. The fundamental idea behind double
exponential smoothing is to use a term that can take into account the possibility that the series will show a
trend.
Double exponential smoothing requires more than just an alpha parameter. It also requires a beta (b)
factor to control the decay of the effect of change in the trend. The smoothing method supports both
additive and multiplicative trends.
The formulas for Double exponential smoothing are:
st=αxt+(1–α)(st−1+bt−1)st=αxt+(1–α)(st−1+bt−1)bt=β(st–st−1)+(1–β)bt−1βt=β(st–st−1)+(1–β)bt−1
where,
• btbt = trend at time t
90
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
• ββ = 0<β<10<β<1
The following Python code demonstrates how to apply Exponential Smoothing using statsmodels to
smooth time-series data effectively.
1. Generating Time-Series Data
np.random.seed(42)
dates = pd.date_range(start="2023-01-01", periods=100, freq='D')
values = np.cumsum(np.random.randn(100)) + 50 # Creating a random walk trend
data = pd.DataFrame({'Date': dates, 'Value': values})
• Generates 100 time-series data points from January 1, 2023.
• Uses a random walk model to simulate a real-world dataset (e.g., stock prices, sales trends).
2. Applying Exponential Smoothing
data['EWMA_0.2'] = data['Value'].ewm(alpha=0.2).mean() # Smoothing factor 0.2
data['EWMA_0.5'] = data['Value'].ewm(alpha=0.5).mean() # Smoothing factor 0.5
● ewm(alpha=α).mean() applies Exponentially Weighted Moving Average (EWMA).
● α (smoothing factor) determines the weight of recent values:
○ α = 0.2 (blue line): More smoothing, long-term trends.
○ α = 0.5 (red line): Less smoothing, reacts faster to changes.
3. Visualizing the Results
plt.figure(figsize=(12,6))
sns.lineplot(x='Date', y='Value', data=data, label='Original Data', color='black', alpha=0.5)
sns.lineplot(x='Date', y='EWMA_0.2', data=data, label='EWMA (α=0.2)', color='blue')
sns.lineplot(x='Date', y='EWMA_0.5', data=data, label='EWMA (α=0.5)', color='red')
• Original Data (black line): Represents the raw time-series data.
• EWMA (α=0.2, blue line): A smoother trend capturing long-term movements.
• EWMA (α=0.5, red line): A more responsive line that adjusts quickly to new changes.
Output of the code is shown in fig 3.3
Output:
91
Bhumi Publishing, India
92
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Real-World Applications
● Stock Market Analysis: Capturing long-term growth trends in financial data.
● Climate Change Studies: Modeling temperature variations over decades.
● Sales Forecasting: Identifying seasonal patterns in business performance.
3.3 Detrending & Time-Series Decomposition
A lot of datasets often yield considerable amounts of data that mask the inner insights intended for a given
analysis, characteristically in time-series. Essentially, detrending refers to the removal of long-term
developments taking away such unwanted trends that obscure short-term variations of interest, whereas
time-series decomposition breaks down a dataset into a limited set of core components [25]:
Trend Component - The movement of data, such as population growth or increasing global
temperatures, occurring over a long time
Seasonal Component - Repeated patterns, for example, higher ice cream sales in summers, increased
online shopping before holidays, etc.
Residual or Noise Component - Random fluctuations that are devoid of any apparent pattern.
Moving averages: This technique smooths the dataset inserting general trends by taking averages using
sliding windows.
Differencing: It removes trends in time-series by only subtracting the two consecutive time values from
each other, which makes the given dataset stationary and more suitable for various analytical methods.
STL Decomposition (Seasonal-Trend Decomposition Using Loess): A composition analysis technique
well-known for separating time-series data into trend, seasonal, and residual components to produce
clearer insights.
For example, economic data often tend to be detrended prior to analysis.GDP growth, inflation rates, and
employment levels vary over time, and by removing enduring trends, an analyst may seize clear
microeconomic cycles on short-term analyses or seasonal effects.
1. Generating Synthetic Time-Series Data:
time = pd.date_range(start='2020-01-01', periods=100, freq='D')
trend = np.linspace(10, 50, 100) # Linear increasing trend
seasonality = 5 * np.sin(2 * np.pi * time.dayofyear / 30) # Monthly seasonality
noise = np.random.normal(scale=3, size=100) # Random fluctuations
data = trend + seasonality + noise # Combine components
• Trend Component: A linear increase in values over time.
• Seasonal Component: A repeating sinusoidal wave simulating monthly patterns.
• Noise Component: Random fluctuations mimicking real-world uncertainty.
2. Performing Time-Series Decomposition:
decomposition = seasonal_decompose(df['Value'], model='additive', period=30)
• Uses additive decomposition, which assumes that the time-series is composed of:
• Trend: Long-term growth or decline.
• Seasonality: Recurring patterns.
• Residual (Noise): Unexplained variations.
3. Visualizing the Decomposed Components:
plt.plot(decomposition.trend, label="Trend Component", color='red')
plt.plot(decomposition.seasonal, label="Seasonal Component", color='blue')
93
Bhumi Publishing, India
94
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
There are different types of projections in mapping:
1. Mercator Projection: This type preserves angles and directions but distorts area, making regions near
the poles appear larger. It is commonly used in navigation maps due to its accurate representation of
direction.
2. Robinson Projection: A compromise projection that balances size and shape distortions. It is used in
world maps to offer a visually pleasing global representation.
3. Equal-Area Projection (e.g., Mollweide, Albers): These projections maintain area proportions but
distort shape and distance. They are often utilized in population density maps or environmental studies.
4. Azimuthal Projection: This projection maintains accurate distances from a central point. It is
commonly used in aeronautical charts to show great-circle distances accurately.
By selecting the appropriate projection, cartographers and data analysts ensure that spatial relationships
are depicted in a way that optimally supports the analysis and communication of insights.
1. Installing Required Libraries
!pip install geopandas matplotlib cartopy
• GeoPandas: Used to handle and visualize geospatial data.
• Matplotlib: Helps in plotting the transformed maps.
• Cartopy: Provides additional support for handling map projections (useful for more advanced
cartography).
2. Loading the World Map Dataset
import geopandas as gpd
import matplotlib.pyplot as plt
# Load the world map dataset
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
3. Defining Projections to Apply
projections = {
"Mercator Projection": "EPSG:3395",
"Robinson Projection": "ESRI:54030",
"Mollweide Projection": "ESRI:54009",
"Azimuthal Projection": "ESRI:54032"
}
• A dictionary is created to store different projection systems (CRS - Coordinate Reference Systems).
• The projections included are:
• Mercator (EPSG:3395) – Preserves angles but distorts area (widely used in web mapping).
• Robinson (ESRI:54030) – A compromise projection balancing shape and size (used in world maps).
• Mollweide (ESRI:54009) – Equal-area projection, best for population or climate maps.
• Azimuthal (ESRI:54032) – Shows distances accurately from a central point.
4. Creating Subplots to Compare Projections
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
• A 2x2 grid of subplots is created to display all four projections side by side.
• The figsize=(12,8) ensures the maps are displayed clearly.
5. Applying Projections and Plotting the Maps
for ax, (title, proj) in zip(axes.flat, projections.items()):
95
Bhumi Publishing, India
96
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Feature Layers:
Feature layers are additional informational overlays on primary data. These layers might represent
features on a map, such as geographic, political, or specific markers.
Example:
• A weather map might represent a feature layer with city names, country borders, or rivers.
• In the scatter plot, a feature layer is trend lines or clusters.
In Kaggle, you can use GeoPandas, Folium, and Matplotlib to create a map with multiple layers.
# Install necessary libraries !pip install folium
import folium
# Create a base map centered at a specific location
m = folium.Map(location=[20, 0], zoom_start=2)
# Add a polygon layer (Example: Highlight a country)
folium.Polygon(locations=[[28, 77], [37, -122], [51, -0.1]], # Example coordinates (Delhi, SF,
London)
color="blue",
fill=True,
fill_color="lightblue",
fill_opacity=0.5
).add_to(m)
# Add a marker layer for key cities
folium.Marker([28.6139, 77.2090], popup="New Delhi").add_to(m) # India
folium.Marker([37.7749, -122.4194], popup="San Francisco").add_to(m) # USA
folium.Marker([51.5074, -0.1278], popup="London").add_to(m) # UK
# Add a circle layer (Example: Highlight population zone)
folium.Circle(
location=[34.0522, -118.2437], # Los Angeles
radius=500000, # 500 km radiuscolor="red", fill=True,
fill_color="pink")
fig.show().
Output of the code is shown in fig 3.7
Output:
97
Bhumi Publishing, India
98
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
color_continuous_scale="Viridis", # color scale
title="Choropleth Map of Countries by GDP")
Step 5: Displaying the Map
fig.show().
This code generates an interactive choropleth map where GDP for nation-states is displayed on it.
Each country is colored based on its GDP, with a color scale for the range. Hovering on a
particular country causes its name and GDP value to appear upon the cursor. The interactive map
can be employed to zoom and pan Output of the code is shown in fig 3.8.
Output:
99
Bhumi Publishing, India
countries would shrink significantly, visually emphasizing global population distribution, less apparent in
traditional maps.
By effectively using cartograms, data visualization practitioners can present complex spatial data.
# Install necessary libraries
!pip install geopandas matplotlib cartopy
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
# Load a world map shapefile
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
# Example: Adding a hypothetical population dataset
world['Population'] = world['pop_est']
world['Cartogram_Size'] = np.log1p(world['Population']) # Apply transformation for better visualization
# Plot the cartogram
fig, ax = plt.subplots(figsize=(12, 6))
world.plot(ax=ax, column='Cartogram_Size', cmap='Blues', legend=True, edgecolor='black')
# Add titles and labels
plt.title("World Population Cartogram", fontsize=14)
plt.xlabel("Longitude")
plt.ylabel("Latitude")
# Show the plot
plt.show()
Output of the code is shown in fig 3.9.
Output:
100
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
The visual representation of uncertainty can help you and your audience understand the quality,
reliability, and variability of your data and make informed decisions based upon them. Uncertainty may
come from various places, including errors in the measurements, bias of sampling, missing data, model
assumptions, or future scenarios. With that representation, uncertainty can express the range of possible
outcomes with a range of confidence or probabilities. Such a visual mapping reveals an overall state of
agreements and/or disagreements, and risks and opportunities concerning your data.
The roles of uncertainty and trends are vital in data analysis as they affect decision-making and
predictions. When examining economic indicators, climate trends, or stock market values, grasping the
effects of uncertainty on trends is critical. In the following sections, we delve into important techniques
for managing and illustrating uncertainty in data.
Techniques for Visualizing Uncertainty:
Uncertainties with respect to the data can arise out of several origins. These include measurement
inaccuracies, limitations associated with the data, and underlying assumptions made with regard to the
model. While deterministic models impose fixed values regarding the data, the actuality of data is usually
probabilistic.
To visualize uncertainty, we can use several techniques:
1. Confidence Interval: This is the probable range of values within which a data point is likely to fall
and is usually expressed numerically, e.g., 95% confidence. Shaded regions around trend lines in
plots represent this concept.
2. Error Bars: A graphical representation commonly used in scatter plots or bar charts to indicate the
variation in data points, thereby providing a direct indication of uncertainty.
3. Violin and box plot: Used to display distributions of data along with median, quartiles and potential
outliers; this enables a useful comparison between datasets containing inherent variability.
4. Predictive uncertainty visualization: This allows for other model projections to issue forth on top of
each other, so you see the range in which a future trend is plausible in machine learning or
forecasting. This enables a reasonable understanding of risk for a decision-maker.
How to Choose a Technique:
When deciding on a technique for visualizing uncertainty, one should think about the type and level of
uncertainty you intend to convey: Is it in a single variable, a relationship, a distribution, a prediction, or a
comparison? The type and scale of your data must also be borne in mind: Is it categorical, numerical,
temporal, spatial, or multidimensional? Is it continuous, discrete, or ordinal? The goal and context of your
visualization should be considered for any visual: Will it be exploratory, explanatory, or persuasive? Who
is your audience, what is their background, and what do they care about? How will your visualization be
displayed and how will users access it? Other considerations include the number of data points. In climate
science, for example, projections of temperature in future decades typically have banded uncertainty
attached, indicating the range of different scenarios for modeling. The same principle is also applied in
stock market forecasting, where projected stock prices will be presented to include potential upper and
lower bounds[26].
Visualizing uncertainty and trends is a very fundamental part of data analysis that cuts a very serious rate
of misleading insights or over-determined conclusions. Analysts can effectively utilize confidence
intervals, error bars, and predictive uncertainty methods for communicating the range or spread of
possible outcomes.
101
Bhumi Publishing, India
Limitations and challenges: The visualization of uncertainty itself has limitations. For instance, many
times, a data set may not have error estimates; some uncertainty measures have no universally standard
definitions or methodology, whereas some uncertainty comparisons may not have any valid or meaningful
baselines. Some concepts of uncertainty are not in accord with general common sense or your audiences'
expectations; some uncertainty visualizations fail to convey the intended meaning or impression of your
data, some uncertainty decisions are not clear for purposes of proper decisions, and it is not possible to
reach an agreed ethical standpoint concerning the issue in the mind of some policymakers, given that they
expect different outcomes anyway. Also, some uncertainty situations may not correspond to or may be
relative to your purpose or context; some uncertainty perceptions may not be welcome or useful for your
audience or stakeholders, and rightfully, some uncertainty manipulations may not be honest or requisite
for your data and impact.
3.5.1 Framing Probabilities as Frequencies:
Framing probabilities in terms of frequencies is a highly effective means of facilitating the understanding
of uncertainty, especially for non-experts. Rather than saying there was a 20% chance of rain, it might be
relatively easier to frame it as a frequency, such as, "Out of 10 similar rainy days, it rained 2 days."
This lends itself to improved decision-making through the contextualization of uncertainty in terms of
real-world experiences. Researchers have found that people understand probabilistic information much
better framed in terms of actual occurrences than in terms of abstraction.
For example:
Before talking about uncertainty visualization, we must first define what uncertainty should exactly mean.
We probably find it easy to intuitively grasp the concept of uncertainty in the context of future events. I
flip the coin and wait for the result. The outcome is uncertain. But there is also what may be called
uncertainty in terms of the past. If yesterday, I looked from my kitchen window exactly two times,
namely at 8 am and again at 4 pm, and at 8 am saw a red car parked across the street but not at 4 pm, I
know it must have left sometime in the eight-hour gap, but I don't know exactly at what time: 8:01, 9:30
am, 2:00 pm, or anytime during that eight-hour delay.
Mathematically, we embrace the notion of chances to deal with uncertainty. Definitions of probability are
rather complex, and this discourse extends far beyond this book. However, we can successfully reason on
probabilities without indulging in all mathematical complexities. In many practically relevant problems,
discussing probabilities as relative frequencies suffices. Suppose you conduct a fair random trial, let's say
flip a coin or roll a die, to check a particular outcome, heads, or roll a six. Suppose you treat this outcome
as success and every other outcome as failure. Then the probability of success will, approximately, be
given by the fraction of times you expect to see that outcome if you were to repeat this random trial
artlessly many times. For example, if an outcome has a probability of 10%, that means that one could
expect to see that result in about one case for every ten trials made. Framing probabilities in terms of
frequencies helps with risk communication by giving audiences a handle on weighing various possible
outcomes. It is particularly useful in places of application such as health care, finance, and weather
forecasting where uncertainty lingers in decision-making [26].
Visualizing individual probabilities can be difficult. How does one visualize the winning chance in a
lottery, or winning a fair die? In both cases, the probability is a single number. One could take that
number as some kind of amount and show it by using any of the techniques discussed in Chapter 6-a bar
graph, or a dot plot, to name just a few. But this would not be very helpful. Most people do not have an
102
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
intuitive understanding of how a probability value translates into experienced reality and showing that
value as either a bar or a dot placed on a line won't assist that endeavour.
To make the concept of probability more concrete, we might develop a graph that combines a focus on
the frequency aspect along with the random trial element's uncertainty, perhaps created by randomly
arranging squares of different colors. I have employed this methodology, as shown in Figure 16.1, to
visualize three separate probabilities: the probability of success of 1%, 10 %, and 40 %. To read this
figure, one should imagine that you have been assigned to randomly choose a dark square without
knowing beforehand which of the squares will be dark or light. (If you will, imagine you are to select a
square with your eyes closed.) You are quite likely to understand that it is not probable to select the one
dark square in the case of 1% chance as shown in fig 3.10. It is also not likely in the case of 10% chance
to randomly choose a dark square. However, this would not look so bad in the case of a 40% chance.
Discrete outcome visualization is the mode of visualization in which we show potential specific
outcomes, whereas frequency framing is the visualization of a probability in the form of a frequency. We
address the probabilistic nature of an outcome in such a way that perceived frequencies of outcomes are
easily understood.
103
Bhumi Publishing, India
104
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
def plot_icon_arrays(probabilities, size=10):
fig, axes = plt.subplots(1, len(probabilities), figsize=(len(probabilities) * 3, 3))
for ax, p in zip(axes, probabilities):
grid = np.random.rand(size, size) < p
ax.imshow(grid, cmap='gray', aspect='equal')
ax.set_xticks([])
ax.set_yticks([])
ax.set_title(f'{p*100:.0f}%')
plt.show()
Output of the code is shown in fig 3.12
Output:
105
Bhumi Publishing, India
ax.set_xticks([])
ax.set_yticks([])
ax.set_title(f'{p*100:.0f}%')
plt.show()
def plot_histogram(data, bins=5):
plt.hist(data, bins=bins, edgecolor='black')
plt.show()
plot_histogram(np.random.randn(100), bins=10)
Output of the code is shown in fig 3.13
Output:
106
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
mean, sem = np.mean(data), stats.sem(data)
margin = sem * stats.t.ppf((1 + confidence) / 2., len(data)-1)
plt.hist(data, bins=10, edgecolor='black', alpha=0.7)
plt.axvline(mean, color='red', linestyle='dashed', label='Mean')
plt.axvline(mean - margin, color='blue', linestyle='dashed', label=f'{confidence*100:.0f}% CI')
plt.axvline(mean + margin, color='blue', linestyle='dashed')
plt.legend()
plt.show()
Output of the code is shown in fig 3.14
Output:
107
Bhumi Publishing, India
plt.show()
Output of the code is shown in fig 3.15
Output:
108
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Output:
Fig. 3.17: The concepts of confidence intervals and prediction intervals are explained in the context
of a linear regression model.
109
Bhumi Publishing, India
Here is a breakdown:
1. Regression Line (Black Line):
This is the best-fit line obtained by linear regression using the equation:
𝑦=10.406𝑥+60.552
It predicts the relationship between the independent variable (X-axis) and the dependent variable (Y-axis,
"Blood Glucose Level").
2. Confidence Interval (Blue Lines):
The 95% confidence interval refers to the range that is expected to contain the true regression line 95% of
the time.
Put differently, if the same experiment were repeated many times, then, 95% of the estimated best-fit
lines would fall within the blue lines.
The confidence interval is smaller than the prediction interval because, for the former, there is only
uncertainty in estimating the regression line and no variability in individual data points.
3. Prediction Interval (Purple Lines):
The prediction interval gives a region in which we an expect new individual data points to fall in 95% of
the cases.
This interval is wider because it carries two sources of uncertainty:
• The uncertainty in estimating the regression line.
• The natural variability in the data (i.e., individual data points will be more spread out than just the
mean trend).
4. Data Points (Red Dots):
These are the observed data points. Some fall within the confidence interval and some extend into the
wider prediction interval.
3.6 Hypothetical Outcome Plots: Hypothetical Outcome Plots convey uncertainty visually in a plurality
of possible outcomes instead of seeing uncertainty only numerically as with static properties like
confidence intervals. Its way of allowing viewers to derive an intuition about variability and the results is
beautiful.
Rather than showing uncertainty via error bars or confidence bands, HOPs allow for animation or moving
sequential frames, where each frame editable presents a reasonable simulation of outcomes, based solely
on the underlying probability distribution. Watching many such simulations gives the viewer a feel of
possible oscillations and, importantly, which direction things are likely to go.
3.6.1 Advantage of HOPs:
Better Understanding of Uncertainty: By illustrating multiple plausible outcomes, they provide users
with a better grip on the range of variation than a static representation.
More Engaging Visualization: Active animations capture attention and stimulate interaction with the
data.
Useful for Decision-Making: Stakeholders can gauge how robust predictions are by how often extreme
cases happen.
3.6.2 Applications of HOPs:
Forecasting and Risk Analysis: Using HOPs allows the illustration of uncertain scenarios that can be,
for instance, a forecasting tool for financial markets, weather predictions, as well as, for medical
prognosis.
110
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Experimental Results Interpretation: Scientists or engineers can use HOPs to illustrate uncertainty in
experimental data.
Machine Learning Model Evaluation: HOPs visualize the variability in the modeled predictions that
can be attributed to randomness in the training data or tuning of parameters.
Generate Hypothetical Outcome Plots:
Simulate Multiple Outcomes: Take a sample from the probability distribution of interest.
Generate Consecutive Frames: Each frame represents a single plausible outcome.
Animate the Results: The sequential display of frames illustrates the range of uncertainty.
Interpret the Patterns: Observe variability and frequency of extreme outcomes.
111
Bhumi Publishing, India
Chapter 4
THE PRINCIPLE OF PROPORTIONAL INK IN DATA VISUALIZATION
Arshdeep Singh1, Vikramjit Parmar2 and Ramandeep Kaur3
1,2,3GNA University, Phagwara
112
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
2. Enhances data readability and clarity: It is easy to comprehend the insights when the visualization is
well-designed and it minimizes unnecessary ink usage. Example: Removing non-data ink such as extra
lines, shading, and unnecessary labels makes the data clearer as shown in fig 4.2.
113
Bhumi Publishing, India
114
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
To avoid such pitfalls, data visualization designers should aim for clarity and accuracy rather than
aesthetic appeal. By observing best practices and sticking to the Principle of Proportional Ink, they can
ensure that visual data representations provide their intended value without deceiving the audience [28].
4.4 Virtualization along linear axis:
Linear axes are most used to represent the data in case of dealing with quantities that have a proportional
relationship. When the Principle of Proportional Ink is applied to the linear axis visualizations:
i. The height of bars in a bar chart should be directly proportional to the values they
represent.
ii. There should be proportional distances maintained between points in the Line Charts to ensure
accuracy.
iii. Uniform spacing should be maintained In the Scatter Plots to accurately represent relationships.
iv. Ensuring proper scaling.
v. Avoiding unnecessary embellishments helps preserve the integrity of the data visualization.
i. Bar Charts:
Bar charts are often used to compare categories or groups. The length or height of the graph
represents the value of the category as shown in the fig 4.6, and the bar must be proportional to
accurately reflect the differences.
Example:
Imagine a bar diagram showing the populations of different cities:
City A: 5,00,000
City B: 10,00,000
City C: 15,00,000
Imagine a bar diagram showing the populations of different cities:
City A: 500,000
City B: 1,000,000
City C: 1,500,000
City C bar is three times the bar of City A and City B is twice the length of City A correctly
representing the populations of the cities.
Common pitfall: If the Y-axis starts at 40000, the visual difference between A and B appears to be
more important than it is. So always launch the y-axis with 0 and make sure bar length is proportional
to the data value.
115
Bhumi Publishing, India
116
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
• Year 1: $200,000
• Year 2: $400,000
• Year 3: $600,000
• Year 4: $800,000
• Year 5: $1,000,000
The vertical positions of the data points must correspond proportional to the income figures, which results
in the same vertical distance as an increase in revenue. To maintain proportionality and avoid misleading
representations, it's advisable to start the y-axis at zero otherwise it can exaggerate the perceived growth
or decline.
117
Bhumi Publishing, India
• Example: If a company wants to show how its revenue, expenses, and profit have changed over
five years, a single line chart can display all three trends clearly, whereas using a bar chart would
require a lot more space.
5. Can Show Small Changes Clearly:
• Because line charts use a continuous line rather than separate bars, even small fluctuations in
data are visible.
• Example: If a scientist tracks the daily growth of a plant, a line chart will show even the
slightest changes, while a bar chart might not represent small variations as effectively.
Disadvantages of Line Charts
1. Not Ideal for Large Data Sets with Too Many Data Points:
• If a line chart has too many data points, it can become cluttered and difficult to read. When
there are too many overlapping lines, the chart loses its effectiveness.
•
Example: If you try to plot the daily stock prices of 50 companies over a year on a single chart,
the lines will overlap, making it impossible to distinguish individual trends.
2. Can Be Misleading if Scales Are Not Properly Set:
• If the vertical axis (y-axis) scale is adjusted incorrectly, a line chart can exaggerate or minimize
changes in data, leading to misinterpretation.
• Example: If a company shows sales growth using a line chart but starts the y-axis at a high
number instead of zero, it might make a small increase in sales look much larger than it
actually is.
3. Not Suitable for Showing Exact Values:
• Since line charts focus on overall trends rather than individual data points, they are not the best
choice when you need to highlight exact values.
•Example: If a teacher wants to compare the test scores of students in different subjects, a bar
chart would be better than a line chart, as it clearly displays individual scores.
4. Difficult to Interpret When Data Fluctuates Too Much:
• If data points have frequent sharp rises and drops, the line may look chaotic and hard to
analyze. This can make it difficult to determine if there is a meaningful trend.
•Example: If you track the number of website visitors every minute throughout a day, the line
may go up and down rapidly, making it hard to identify useful trends.
5. Limited Use for Categorical Data:
• Line charts are best for continuous data and are not suitable for categories that do not follow a
natural order.
• Example: If a company wants to compare customer satisfaction scores for five different
products, a bar chart would be better than a line chart because customer ratings do not follow a
continuous progression like time or temperature [29].
iii. Scatter plots:
Scatter plots represent the relationship between two continuous variables, Where each point's position
along the x and y axes represents its values for the two variables by plotting data points on a Cartesian
plane.
Example: A scatter plot (fig 4.9) illustrating the relationship between exam scores and hours studied:
X-axis: Represents hours studied
118
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Y-axis: Represents exam Score
Each point represents hours of study and the corresponding exam score of a student. The pattern and the
distribution of points can reveal correlations, trends, or outliers.
Common Pitfall: There can be chances of overplotting when the data points overlap each other, which
can make it difficult to understand the data distribution. In the case of overlapping techniques, such as
jittering (adding small random variations) can be employed to improve clarity.
Advantages of Scatter Plots:
1. Excellent to Identify:
119
Bhumi Publishing, India
• Example: A scientist tracking global temperatures over the last 100 years can plot thousands of
data points to identify long-term climate trends.
5. Can Show Non-Linear Relationships:
• Scatter plots are not limited to straight-line trends; they also reveal curved, exponential, or other
complex relationships between variables.
• Example: In economics, the relationship between tax rates and revenue might not be a simple
straight line but a curve showing diminishing returns.
Disadvantages of Scatter Plots
1. Cannot Display Exact Data Values Conveniently:
• Because every point only displays one value, scatter plots cannot display exact numbers like bar
charts or tables.
• Example: If a company wishes to compare sales per month between various years, a bar chart
may be more suitable because it displays exact values more conveniently.
2. Does Not Work Well with Categorical Data:
• Scatter plots need numerical data for both axes, so they are not a good choice for categories such
as "Male vs. Female" or "Product Type A vs. Product Type B."
• Example: If a store wishes to compare customer satisfaction ratings for five brands, a bar chart
would be preferable to a scatter plot.
3. Hard to Interpret When Data Points Overlap Too Much:
• If too many points are on top of each other, it is difficult to discern separate data points and
trends.
• Example: A hospital plotting patient weight against cholesterol might have many data points
piled on top of each other, making it difficult to interpret. Jittering (shifting points slightly) or
transparency can remedy this.
2. Does Not Always Prove Causation:
• Scatter plots indicate correlation but not causation—just because two variables vary together
doesn't mean one causes the other.
• Example: A study could indicate that ice cream sales and drowning rates go up at the same time,
but it does not imply that eating ice cream leads to drowning. The actual reason is that both occur
more during the summer.
3. Can Be Misleading if Scales Are Not Chosen Properly:
If the scales on the axes are adjusted, the correlation between variables can seem stronger or weaker
than it is.
Example: A business might modify the scale on a scatter graph of ad spending to sales such that the
graph could appear to suggest that spending a bit more results in a dramatic increase in sales even
though the actual effect is small.
4.5 Visualization along logarithmic axes:
Logarithmic axes are practical for showcasing data that spans multiple orders of magnitude. When a
logarithmic scale is used, the spacing is determined by a logarithmic function rather than a linear one
between the values. In a logarithmic scale, each unit increase on the axis corresponds to a multiplication
of a constant factor (commonly 10) of the previous value. For example, the intervals can be 1, 10, 100,
1,000, and so on illustrated in fig 4.9. This scaling is advantageous when processing data covering a wide
120
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
range, as it compresses larger values and expands smaller ones, which allows for more comprehensive
visualization [29].
121
Bhumi Publishing, India
122
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
123
Bhumi Publishing, India
124
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
125
Bhumi Publishing, India
126
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
127
Bhumi Publishing, India
128
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
4. Bubble Charts for Comparing Multiple Variables:
Bubble charts present three dimensions of data: X-axis, Y-axis, and bubble size. They are frequently
employed in economics, healthcare, and environmental science to illustrate relationships between various
variables.
Example: Global health bubble chart with each nation appearing as a bubble; GDP per capita on the x-
axis, life expectancy on the y-axis, and bubble size for population.
5. Epidemiology and Disease Outbreak Visualization:
Direct area visualization in monitoring the outbreak of diseases such as COVID-19, Ebola, or the flu.
Circular or shaded map areas indicate areas of infection, facilitating policymakers' planning of
interventions.
Example:
A COVID-19 heatmap (shown in fig 4.26) in which red-colored areas have higher infection rates, and
blue or green-colored areas represent lower case numbers.
129
Bhumi Publishing, India
Example:
130
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
131
Bhumi Publishing, India
132
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
133
Bhumi Publishing, India
134
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
135
Bhumi Publishing, India
136
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
137
Bhumi Publishing, India
138
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
2. Scientific Research and Astronomy:
2D histograms are employed to graph galaxy distributions, space temperature gradients, and black
hole radiation patterns in astronomy.
Example: A 2D histogram of star brightness vs. temperature is used to determine the type of stars (as
shown in fig 4.43)
139
Bhumi Publishing, India
140
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
141
Bhumi Publishing, India
142
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
3. Scatter Plot Density Representation:
Plotting numerous points, contour lines clump together the overlapping data, showing major trends.
Example: An education level vs. population income scatter plot with contour lines illustrates the most
frequent income ranges in fig 4.50.
143
Bhumi Publishing, India
It is applied to study player movement, ball path, and scoring heatmaps in sports.
Example: Player movement heatmap in a soccer game as shown in fig 4.42
144
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Chapter 5
COLOR USAGE IN DATA VISUALIZATION
Shruti1, Arshdeep Singh2, Neharika Sharma3 and Sumit Chopra4
1,2,4GNA University, Phagwara
3Rayat Bahra Group of Institutions and Nanotechnology, Hoshiarpur
145
Bhumi Publishing, India
146
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
147
Bhumi Publishing, India
representing data using an alarming color red for neutral information is likely to cause panic and alarm in
the absence of one. Similarly, using subtle pastels in reporting very discordant data, such as risk
assessment reports and emergency response dashboards, will not highlight seriousness for proper
attention.
Hence, without aligning one's color choices with conventional meaning and expectations of the audience,
clarity and effectiveness in data visualization will not be achieved. Following common color conventions
should therefore enable designers and analysts to not only aid comprehension and minimize cognitive
overload but also allow viewers to grasp the insights with minimal confusion.
Fig. 5.3: Earnings are shown in red color while loss in green.
If a design or interface does not conform to established conventions, the users will be confused about how
to use it. Design consistency allows users to anticipate functionality, and when they are broken, it can
create frustration and confusion as shown in fig 5.3. For instance, if commonly used symbols or colors are
used in a manner different from normal, the users will misinterpret their meaning. It leads to
misinterpretation of information: Inadequate design decisions, like deceptive visualizations, confusing
labels, or inconsistent data representation, may lead to users misinterpreting major insights. This is
particularly undesirable in graphs and charts, where incorrect scaling, deceptive colors, or perplexing
legends can lead to inaccurate conclusions being drawn. Intuitive and precise presentation is needed to
effectively present accurate information. When a chart or graphic is not clearly and simply designed,
users must take additional time to learn how to read and understand it. Confusing or cluttered graphics,
ambiguous legends, or irregular formatting add cognitive load, making it more difficult for users to get
the critical points of the information quickly. A neat chart ought to convey insights without demanding a
lot of effort from the reader [34].
Some people utilize color on things that do not need it. For example, if all of the text on a page is in
different colors without a real reason, then it is distracting. When design elements, like too many colors or
ornaments, are added without purpose, they don't help make the message more understandable. Rather
148
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
than helping people understand better, they become distractions that do not make the content clearer or
more effective. Each design decision must serve a functional purpose to enhance communication. A
messy or disorganized design may give a website, presentation, or document an unprofessional look.
Inconsistent application of colors, fonts, or graphics can provide an amateurish look, decreasing
credibility. Well-organized, well-balanced design reflects attention to detail and strengthens the overall
user experience. It can cause readability issues, especially on some backgrounds: Insufficient contrast
between text and background colors may result in difficult reading, especially for visually impaired users.
For instance, light-colored text on a white background or dark-colored text on a very dark background
will cause eye strain. Providing adequate contrast and accessible color schemes enhances readability and
user experience for all users.
It is best to avoid irrelevant and meaningless colors in design and even data visualization by advocating
best-practice measures for clarity and meaning. The best of these includes one of the most effective
methods: complying with color conventions. The reality is that some colors are universally accepted as
having specific meanings—for example, red for warnings or errors; green for successful operations; and
blue for hyperlinks. As a result, by adhering to the expected use of colors, information can be stumbled
across within an environment as intuitively as possible. Another element is that color should be used only
where it is really needed. Too many unnecessary colors can introduce unnecessary 'clutter' and elsewhere
make it harder for users to see the more important things. Color should be for emphasizing significant
data points, differentiating categories, or guiding user attention. Any color which does not seem to serve a
function would better be left out. Finally, there's consistency in color application too. In particular, similar
items must always be coded in the same way for a logical flow of coherence. So blue for hyperlinks in
one page shouldn't necessarily be the same blue coded for error messages in another. All done in a
consistent way, users will realize how easy it is for them to navigate interpretive documents without going
into too much trouble.
5.2.3 Pitfall 3: Employing Nonmonotonic Color Scales to Represent Data Values
Color is a critical data visualization tool that allows the audience to rapidly understand complicated
information. However, the misuse of color scales can result in misinterpretation, visual disorientation, and
data distortion. One of the most frequent errors is the utilization of nonmonotonic color scales—scales
that are not smooth, perceptual, and therefore make it hard to interpret data values correctly. There are
two types of color scales i.e. Monotonic Color Scale and Nonmonotonic Color Scale. A monotonic color
scale is a logical, perceptually smooth progression that allows viewers to easily relate increasing or
decreasing color intensity to a corresponding change in data values. A nonmonotonic color scale,
however, does not exhibit a smooth or consistent progression. Rather, it:
Using lots of colors without systematically applying them can create a chaotic and nonsensical visual
experience. When colors are placed in a random manner as opposed to a logical one, users have difficulty
discerning patterns and meaning. A lack of consistency complicates information being read quickly and
efficiently. Sudden changes in brightness or saturation can be visually jarring and distracting. High-
contrast color transitions, unless used intentionally, can make text harder to read and comprehend. In
charts or graphs, these kinds of changes might mislead people by creating spurious points of emphasis
that don't accurately represent the significance of the data. In the case of irregular distribution of colors
over a design or data visualization, some information may automatically end up being emphasized more
than others. This distorts the perceived meaning of data points and can lead to erroneous interpretation. A
149
Bhumi Publishing, India
balanced color scheme ensures emphasis is obtained consciously and consistently, guiding users to the
appropriate conclusions.
150
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
it is justifiable to declare a true and meaningful midpoint to the data. In other words, any situation where
the information depicts positive-negative (e.g., profits-losses, temperature differences, or election
outcome), diverging color schemes such as red-white-blue would indicate the differences. Nuisance use
of diverging colors can, however, give rise to the wrong impression, especially when the midpoint bears
no operational weight.
5.2.4 Pitfall 4: Failing to Design for Color-Vision Deficiency:
Color plays an essential role in design, data visualization, and user experience. It enhances 3readability,
boosts engagement, and conveys meaning quickly. However, color perception varies among individuals.
Those with color vision deficiency (CVD), commonly known as color blindness, may struggle to
differentiate between certain colors. In spite of the widespread nature of color blindness, numerous
designers, developers, and researchers do not consider it, which leads to inaccessible designs, deceptive
data visualizations, and poor user experience.
Color vision deficiency (CVD) is a condition that causes individuals to have difficulty differentiating
between specific colors as a result of how their eyes perceive light. About 300 million people globally are
affected, which includes 1 in every 12 men and 1 in every 200 women. It affects designs as most
designers take it for granted that color is perceived by all people in the same manner, resulting in visual
elements that are impossible—or at least very hard—for colorblind people to understand. Overlooking
color accessibility can lead to unclear warnings, alerts, and UI elements [35].
Fig. 5.5: Color-vision deficiency (CVD) simulation of the sequential color scale Heat
The designing for color blindness matters because using only color to convey crucial information may
create accessibility problems since some users will not be able to comprehend the message. Though color
is a strong visual device, it must always be paired with additional design elements to maintain clearness
and inclusiveness. For example: Take an online form where the errors are merely highlighted by a red
outline or red text. A user who is completing the form may come across a field that becomes red when the
user enters incorrect data. But the problem is that not everyone views colors in the same manner.
Individuals with red-green color deficiency, the most prevalent of the color vision deficiencies, might not
be able to tell the red error indicator apart from other items on the page. This would lead to confusion, as
they would either not know an error has been made or have difficulty finding it on the form. Also, in low-
light environments or on low-contrast screens, red marks may not be easily distinguishable enough to
clearly get the point across. But the solution is to make sure that the error messages are understood by
everyone, designers need to include more than one visual cue in addition to color. Placing an exclamation
point or warning symbol next to the error field makes the problem stand out even for users who are
unable to see red. Showing a message like "Invalid email format" gives a concise description of the issue.
Applying a dotted or bold underline rather than simply altering the color makes the problem noticeable in
a non-color-based manner.
151
Bhumi Publishing, India
To define categories in graphs and charts through color alone creates a problem for accessibility,
especially for people who have color vision deficiency. The chart should be designed in a way that
everybody, irrespective of their capacity for color perception or not, has no difficulty at all interpreting
data. For example: A pie chart with red, green, and blue used for different categories and no labels,
patterns, or other distinguishing markings. Users need to distinguish among the sections just by color. But
the problem is that individuals who are red-green color blind (the most prevalent type) won't be able to
tell red from green sections and thus interpret the chart effectively. Even those with typical color vision
might struggle in dim lighting or on displays with low color contrast. If a chart is displayed in grayscale,
all colors can be represented as varying shades of gray, rendering the data virtually impossible to
interpret. Users will have to refer to an external legend, which means they will have to keep switching
their attention back and forth, which raises cognitive load and slows down interpretation. Rather than
depending on color alone, fill areas of the chart with striped, dotted, or crosshatched patterns. This
enables users to distinguish areas even in grayscale or color-blind viewing environments. Simply labeling
every section of a chart, as opposed to utilizing a distinct legend, allows the user to perceive the data
without color-matching. In line graphs or bar charts, employing a variety of different shapes (i.e.,
triangles, circles, squares) for data points allows them to be distinguished without reliance on color. For
digital charts, making accessibility settings available to enable users to alter colors or enhance contrast
can improve readability for a variety of audiences.
Best Practices for Color-Blind-Friendly Design:
Color palettes that are accessible to the general public, including people with color vision deficits (CVD)
shown in fig 5.5, should be used when developing data visualizations. Data is misconstrued because most
traditional color schemes are difficult for colorblind people to distinguish. Color palettes created
especially for accessibility are offered by visualization programs like Tableau, Matplotlib, and D3.js to
promote diversity.
Scientific data visualization becomes an important aspect for making complex data more understandable
so that researchers and professionals can detect patterns, trends, and outliers. Visualization should ensure
that the data is not only good-looking but also correct, readable, and available to everyone using it. A
major aspect of data visualization is the selection of color schemes because incorrect choice of color
schemes can cause misinterpretation and inaccessibility. It is here that Viridis, a perceptually uniform
colormap, has found itself an indispensable resource for scientific visualization. Scientific visualizations
have historically used the rainbow color scale, based on a sequence of colors from violet through to red.
Although it is perhaps the most intuitive-looking choice, the rainbow scale has severe limitations. It is not
perceptually uniform, i.e., certain transitions of color look more extreme than others, even though the data
behind them is changing uniformly. This can skew the view of data relationships and result in false
conclusions. The rainbow scale is also a problem for color vision deficient people, especially those with
red-green color blindness, since some colors become impossible to distinguish. Viridis, created as a better
alternative, is a colorblind-friendly, perceptually uniform, and high-contrast colormap. It is a smooth
gradient from dark blue to light yellow, so all transitions are perceived evenly. In contrast to the rainbow
colormap, which produces artificial visual discontinuities, Viridis has uniform luminance progression,
making it simpler to perceive subtle data variations. This makes it particularly valuable in scientific
applications like heatmaps, geospatial maps, and medical imaging, where precise color reproduction is
critical to making accurate conclusions [35].
152
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
One of the most important benefits of Viridis is that it is accessible. It is created to be readable by people
with different types of color blindness, so that scientific information is not excluded. In addition, it is very
readable when printed in black and white, which makes it a useful option for research articles, where
reproduction of color is not always guaranteed. Its simplicity, perceptual similarity, and easiness to
produce make it particularly suitable for visualizing data with the assurance of presenting accurate
meaning and being read easily by as wide an audience as possible. Choosing the correct color scheme,
though, is not always simple. Inaccurate color use can result in misinterpretation, unavailability for
colorblind users, and poor communication of important findings. To counter these problems,
ColorBrewer, a cartography tool created by cartographer Cynthia Brewer, offers optimized color palettes
specially made for mapping and data visualization. It is a collection of well-designed palettes of colors
that improve readability, accessibility, and perceptual correctness. The palettes find extensive use in
geographic information systems (GIS), thematic maps, and statistical graphics [36].
153
Bhumi Publishing, India
an effective instrument in cartography, demography, environmental mapping, and urban planning, where
intuitive and unambiguous visual communication is critical [36].
Dependence on color alone to communicate information in charts, graphs, and maps can lead to
accessibility problems, particularly for people with color vision deficiencies. To enhance clarity and
inclusivity, designers and data visualizers need to add other elements like labels, patterns, and icons.
These add emphasis to the message, making visual data easier to understand for a wider audience. Labels
are one of the easiest methods of enhancing data visualization. By inserting text annotations directly into
a chart or map, users can instantly comprehend the meaning of each part without needing to consult a
legend elsewhere. For instance, in a pie chart, inserting percentage values and category names inside the
segments obviates the need to correlate colors with an external key, accelerating understanding. In the
same manner, labeling points directly in a scatter plot can render trends more obvious, avoiding
ambiguity due to overlapping hues. Patterns offer another non-color-related method for distinguishing
parts within a visualization. This works well for grayscale prints or where people have difficulty
perceiving color differences. In bar graphs, for example, striped, dotted, or crosshatched patterns may be
employed in place of (or in addition to) color to differentiate among various data groups. In geographical
maps, land use types (urban, forest, and water bodies) may be symbolized with textured overlays such
that each category can be identified even in the absence of color.
154
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Chapter 6
INTRODUCTION TO TABLEAU
Vikramjit Parmar1, Debjit Mohapatra2, Suruchi3 and Shruti4
1,2,3,4GNA University, Phagwara
155
Bhumi Publishing, India
an online store is able to track website traffic, sales performance, and customer behavior in real time,
allowing them to make decisions on inventory, pricing, or marketing strategy immediately. This real-time
capability allows organizations to stay agile and respond instantly to shifts [38].
3. Advanced Analytics & AI Integration:
Tableau also has advanced analytics features that are integrated, such as statistical modeling, forecasting,
and insights based on AI.
All aspects of trend analysis, predictive forecasting, and cluster algorithms are able to utilize using non-
coding skills. AI and machine learning models also become integrated using Tableau, such that entities
make use of auto-detect anomaly, prediction model, and sentiment analysis across enormous datasets. As
an example, a retail venture powered through AI can anticipate purchase patterns in the customer segment
and adjust promotions using their predictive insight. With that, AI incorporation builds upon data-driven
insight with revealing probable patterns one does not at first view through raw analysis.
4. Mobile-Friendly Design:
Tableau ensures that reports and dashboards are fully responsive and mobile-ready, i.e., accessible and
consumable across multiple devices, e.g., tablets and smartphones.
Contrary to some other report-focused reporting programs requiring distinct mobile-consumption settings,
Tableau dynamically sizes up the visuals for the availability of multiple display screens. This is most
handy for executives and field agents that need access to essential business data when away from the
workstation. For example, a sales regional manager can see actual time sales performance as well as
customer patterns right on their mobile phone in order for them to make sound decisions even outside of
their workplace.
1. Overview of Tableau Environment:
Prior to using Tableau for data visualization and analysis, it is important to set up the environment
correctly. The initial setup process is important in ensuring seamless functionality, maximum
performance, and effective management of data. Although Tableau has a user-friendly and intuitive
interface, incorrect installation or configuration can result in performance problems, compatibility issues,
and challenges in managing large datasets.
The installation process varies based on the Tableau product being used, either Tableau Desktop, Tableau
Server, or Tableau Cloud. Users need to ensure that their system supports the required hardware and
software requirements such as the availability of enough processing power, memory, and storage
capacity, particularly when dealing with large data sets.
Moreover, correct data connection configuration is crucial. Tableau accommodates multiple data sources,
including relational databases, cloud platforms, and spreadsheets. Configuring secure and optimized
connections to these data sources allows data to be loaded, processed, and visualized effectively. By
installing and configuring Tableau with care, users can optimize its performance, minimize possible
errors, and produce smooth data-driven visualizations for enhanced decision-making.
2. Minimum System Requirements:
Before installing Tableau, make sure your system has the required hardware and software specifications.
156
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Table 6.1: Operating System and Windows 10 (64-bit) or later/macOS 10.14 or later
Operating System Windows 10 (64-bit) or later/macOS 10.14 or later
Processor 2 GHz or faster (multi-core recommended)
RAM Minimum 8GB (16GB+ recommended for larger datasets)
Storage At least 5GB of free disk space
Internet Required for activation and cloud-based features
Database Connectivity ODBS/JDBC for databases like MySQL, SQL Server, PostgreSQL
3. Setting Up Tableau:
The installation of Tableau is quite simple, but proper setup is critical for flawless functioning. Be it
Windows or Mac, the process of installation of Tableau Desktop and Tableau Public remains more or less
the same with minor variations in file running and directory configuration.
The initial step is to download Tableau from the website (tableau.com). Go to the Download page and
select the proper version—Tableau Desktop for business reporting and analysis, or Tableau Public, a free
version with limited features. To continue with the download, users are required to log in or register for a
Tableau account. After downloading, installation is initiated by executing the installer file. In Windows,
users are supposed to double-click the .exe file, whereas for Mac, they have to open the .dmg file and
move the Tableau app into the Applications folder. Upon opening the installer, users have to read and
agree to the End User License Agreement (EULA) to continue. Second, users have the option of selecting
the install location. For Windows users, they can choose to install Tableau in a custom location if
necessary, but Mac users directly install it inside the Applications folder by default. After installation, the
software automatically asks users to log in. For Tableau Desktop, the user must insert their license key to
activate it. For Tableau Public, the user will be asked to sign in with their Tableau account credentials so
that they could access cloud-enabled visualization capabilities. Once logged in, the ultimate step is
finishing up the configuration. When Tableau is opened, the Start Page is displayed, where users can
navigate through available features and link to data sources like Excel files, databases, or live cloud-based
data. With installation and configuration now done, users can start creating interactive dashboards and
meaningful visualizations, making Tableau an effective tool for data-driven decision-making [39].
6.1.2 Navigation in Tableau:
1. Introduction to table navigation:
Tableau offers a simple and intuitive user interface for visualizing data as shown in fig 6.2. Effective
navigation of Tableau is important to accomplish activities like importing data, creating visualizations,
and managing dashboards. This chapter discusses Tableau's main components, their functionalities, and
how users can use the interface smoothly.
2. Elements of the Tableau Interface:
The Home Page: The Start Page is the initial screen to be displayed upon launching Tableau. Recent
workbooks, sample datasets, and several data connection options are readily available. Elements that
constitute the Start Page: The Connect Panel allows users to connect to databases, cloud apps, and Excel,
among others. You can find the most recent Tableau workbooks you opened under "Open/Recent
Workbooks."
3. Navigation and Interface of Workbook:
Once you have opened a workbook or have created a new one, Tableau shows you its primary workspace,
which includes a number of significant components.
157
Bhumi Publishing, India
2. Toolbar:
The Toolbar, which is situated beneath the bar menu, offers easy access to often-used features.
Table 6.3: Toolbar
New Dashboard Create a new dashboard
Undo/Redo Reverse or redo last action
Save Save Workbook
Show Me Opens Chart suggestions based on highlighted data
Formatting Customize fonts, color and border Styling
New Worksheet Add a new Worksheet
Zoom Adjust visualization’s zoom level
158
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
3. The Data Pane:
Data Pane displays Dimensions (categorical fields) and Measures (numerical fields). Your data fields are
separated into two categories by the Data Pane's most prominent part: Dimensions: Typically qualitative,
category data such as names, dates, or geographic data. Blue icons are typically used to represent
dimensions. Measures: Numerical, quantifiable information that can be summed and analyzed. Green
icons usually illustrate measures.
Tableau applies this section to tell you which fields are employed to categorize and group your data
(dimensions) and which fields can be counted or measured (measures).
4. Worksheets, Dashboards, and Stories:
On the bottom row, Tableau has three chief components for visualization creation:
Worksheets:
A sheet (or a "worksheet") is the base unit in Tableau, where a single graph, chart, or table enabling the
user to analyze a given detail of data. A single worksheet is supposed to be working with one view, like a
bar chart, line graph, or scatter plot. One of its strong points is interactivity, allowing users to apply
filters, sort data, and emphasize important insights in the sheet directly. A workbook may include several
worksheets, and users can analyze different views of the same data before combining them into
dashboards or reports. Because worksheets are extremely flexible, users can test different visualization
techniques before settling on their data presentation. To add a new worksheet, the users can just click the
\"+\" icon at the bottom of the screen.
Dashboards:
A dashboard comprises more than one worksheet presented together on one screen as shown I fig 6.3,
giving a wide overview of insights in the data. Dashboards allow users to aggregate findings across
various worksheets, facilitating easy comparison and correlation of data. Dashboards also have interactive
capabilities like filters, dropdowns, and highlight action, allowing users to browse and analyze data
interactively. With a flexible layout, dashboards can be resized, rearranged, and formatted to enhance
readability. Additionally, when connected to a live data source, dashboards update automatically,
ensuring real-time data visualization. Users can fine-tune the layout by using the Dashboard Layout
Panel, which provides customization options for adjusting component sizes and positions [40].
159
Bhumi Publishing, India
Stories:
A Tableau story is a designed series of worksheets and dashboards, viewed as a slideshow to convey
insight in a methodical, step-by-step order. This aspect is especially applicable for data storytelling, where
the results are formatted chronologically to take an audience through a tale. Every point in the story (or
slide) has a dashboard or worksheet, serving to break down involved data into manageable insight. Stories
also have interactive components, which allow users to investigate various parts of the data within the
story. Users can create a story by dragging and dropping worksheets or dashboards into the Story
workspace, with a seamless and directed flow of information.
6.1.3 File and Data Types in tableau:
Types of Files and Data in Tableau:
Tableau's file and data types: Tableau is a robust data visualization tool that works with a wide range of
file and data formats. To properly process, visualize, and analyze data, they must be understood. This
section covers the kinds of files that Tableau can handle, the kinds of data that Tableau can use, and the
best ways to handle them.
6.1.3.1 File types in Tableau:
Workbooks, data sources, extracts, and packaged data are all arranged into several file formats by
Tableau. These formats are divided into two categories: those specific to Tableau and those for external
data sources.
1. File Types Particular to Tableau:
Tableau uses its file extensions to store and share data visualizations effectively.
Table 6.4: File types particular to Tableau
File Type Extension Purpose
Tableau Data Source .tds Saves connection information but not the data
Tableau Packaged data source .tdsx Saves both connection information and pulled data
Tableau Workbook .twb Saves visualization but not data
Tableau Preferences .tps Saves custom color palettes for consistent styling
Tableau Workbook .twb Stores visualization but not data
Tableau Map Source .tms Saves customized geographic maps
Tableau Data Extract . hyper/.tde Compressed, optimized snapshot of data performance
Tableau packaged Workbook .twbx Saves both visualization and data for convenience
2. Formats of External Data Sources:
There are many different external data sources that Tableau may connect to. Typical file formats include
the following:
Table 6.5: Formats of external data sources
File Type Extension Purpose
Comma-Separated Values(CSV) .csv Common text-based tabular format
Microsoft Access Database .mdb, .accdb Links access database files
Microsoft Excel .xls,.xlsx Connect to Excel Wrokbooks
Text files .txt Uses tab or other delimiters for tabular data
Google Sheets Online Link to cloud spreadsheets
160
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Data Types in Tableau:
1. Primary Datatypes:
Table 6.6: Primary datatypes
Datatype Description Example
Date Date values without time “16-02-2001"
Date & time Contains both data and time “16-02-2001 12:30 P.M”
Boolean Trues/False values used for logical operations “In Stock: True/False”
String(text) Categorical data represented as text “Product name”, “Category”
Geographic (Location) Geographic fields like Country, lattitude “USA”, “New York”, 28.718° N
Number(Whole/Decimal) Numerical values(both whole and decimal) 200, -25, 3.2
2. Automatic Data Type Identification:
For proper analysis, you might have to change data types sometimes. With data transformation or
calculated fields, Tableau provides you with the facility to alter the data type of a field.
Example1: Conversion of String to date:
If data filed contains a string("21-02-2001"), you can convert to date using:
DATEPARSE("yyyy/MM/dd", [Date_String])
Example 2: Conversion of string to number
If numeric field is stored as text(eg. “254”), you can convert to number using:
INT([String_Field])
3. Discrete and Continuous Data:
Discrete data is data that falls within definite groups or categories. It cannot be meaningfully divided. In
Tableau, it is generally represented in blue. Examples are customer names (e.g., "John," and "Michael")
and product categories (e.g., "Furniture" and "Clothing"), which are all discrete data examples. Discrete
data is categorical or countable in nature. There are no meaningful values in between (e.g., "half an
employee ID" is not a valid value). It is utilized for filtering and grouping data. Continuous data is
measurable values which may have any value in a range. It is numeric and may be in fractional or decimal
form. In Tableau, it is generally shown in green color. It is always numeric and measurable. There are
infinite possible values in a range. It can be subdivided (for example, time can be subdivided as hours,
minutes, and seconds). Examples of Continuous Data: Age (such as 18, 21.5, 30.7), Time (such as 12:00
A.M, 2:02 P.M) .
6.1.3.2 Data Source:
In Tableau, a data source is the basis for producing visualizations. It is where the raw data is located,
which Tableau processes, analyzes, and interprets into valuable insights. A data source can be from
manually constructed datasets, Excel files, cloud services, or enterprise databases. The ability of Tableau
to work with various data sources is what makes it a formidable business intelligence and data analysis
tool [40].
Type of Connection:
Tableau provides the capability to connect to data in two manners: Live Connection and Extract
Connection. Live Connection gets data directly from the source in real time so that visualizations can be
up to date with the latest updates. This is ideal for situations where data constantly changes, e.g.,
analyzing stocks or observing live sales. But performance relies on network efficiency and database
speed. Conversely, an Extract Connection builds a static snapshot of the data, keeping it locally for
performance optimization. Extracts enhance processing efficiency by minimizing repeated database
queries, thus being perfect for use with big datasets or where real-time updates are not needed.
161
Bhumi Publishing, India
162
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Custom Data View feature enables you to alter, restructure, and customize the data source view prior to
utilizing it for visualization. By using filtering, renaming, creation of calculated fields, definition of data
types, and organization of data connections, it enables users to personalize and refine data as illustrated in
fig 6.4.
A Custom Data View in Tableau enables users to alter and refine a dataset prior to developing
visualizations. The process entails cleaning and formatting data, combining several tables by joining or
connecting them, eliminating unwanted data, designing calculated fields for bespoke measures, and field
renaming or modifying data types. By pre-refining the dataset, users make sure that the data is thoroughly
structured and geared for effective analysis.
Why a Custom Data View is Necessary?
Building a Custom Data View assists in eliminating unnecessary data clutter, making visualizations more
effective by loading only necessary information. It also improves dashboard performance, with only the
necessary data being processed. Data integrity is enhanced by eliminating inconsistencies and filtering out
errors, and the dataset's flexibility is increased through the application of custom SQL queries and
calculated fields. These modifications make the dataset more dynamic and flexible for advanced analysis.
Example of a Custom Data View in Tableau:
Suppose there is a sales dataset that has fields that do not pertain to sales such as "Customer Phone
Number" or "Order Processing Time." Rather than handling a messy dataset, you have the ability to hide
unimportant fields, set filters on a given year of data, and add a new field for the profit percentage.
Through such cleansing of the dataset, it will become tidy, more efficient, and more revealing, thus
informing sounder decisions.
6.1.5 Extracting data:
Data extraction in Tableau refers to the method of pulling a smaller portion of data from a large dataset
and saving it in a specific extract file format. This practice alleviates pressure on live databases, enhances
performance, and facilitates easier offline access. By storing a compressed version of the data, extracts
allow Tableau to work with massive datasets effectively while speeding up queries and interactions.
Data extraction offers several advantages:
Data extraction provides loads of benefits when used in Tableau, chief among which is performance.
Since Tableau does not request data from live databases each time a query is executed, this intrinsic
feature positively impacts the performance of working with extracted data. The better the performance,
the faster the query execution, easier the loading of dashboards, and seamless interaction enjoyed by users
with their visualizations. This advantage is an important aspect when working with more extensive
datasets or complex calculations.
Offline availability is another key benefit. When a lot of extracted data is available, users can utilize their
datasets without the availability of an internet connection or a live database. This is most helpful to
professionals who need to analyze data while on the move or during remote work assignments. With
locally saved extracts, users can create reports and further insights without worrying about constant
access to a live database connection.
Data extraction adds to information security from another angle. Organizations can restrict access to
sensitive data by storing it in a local repository. Extracts can be shared with users authorized to access
that particular data without giving them access to the entire live database. This lowers the chances of a
security breach.
163
Bhumi Publishing, India
Another advantage extends to streamlining and organizing data. Extracts have only relevant data needed
to maintain a tidy dashboard and free of clutter. This way, destructured data is cleaned, thus making it
easier for users to analyze data trends and obtain useful insights.
In addition, the practice aids in reducing the load on servers. Since Tableau holds extracted data locally,
the multiple queries sent simultaneously from many users to the live database are spared. This ease loads
on database servers and helps keep the system efficient. Any organization that relies on real-time
analytics can hugely benefit from extracts by ensuring the availability of live data sources for core
operations, while for reporting and analysis, Tableau users can use extracts as depicted in the fig 6.5.
164
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Step 2: Choose the Extract Mode:
Now that the data connection is established, it's time to decide whether to conduct all operations using a
Live Connection or an Extract. A live connection will continuously pull real-time data from the source,
which can be exhausting in terms of resources. Instead, choosing Extract will allow Tableau to save a
snapshot of the data locally and improve the speed of any query, thus lessening the load from accessing
the original data source. To toggle from the Live feature to an extract mode, just select Extract from the
Data Source column.
Step 3: Set the Preference for Extraction:
The extraction preferences before finalizing are to configure extraction settings to optimize performance.
Clicking Edit on the extract settings allows users to set filters, aggregation, and limits on data to be
retrieved. Extract filters can be to include only the required records, which can also help avoid extraneous
data extraction. For example, in a dataset that contains sales records spanning multiple years, a filer can
be applied for sales records in 2023 and thereafter, reducing the extracted size but still very relevant.
Step 4: Save and Extract:
After setting the extraction parameters, the extract file needs to be saved in a location. Clicking on Extract
Data will begin the extraction process and save the data in Tableaus extract formats of .hyper or .tde. The
.hyper format is the latest and more optimized version of extract file types, promising high performance
and good handling of large datasets. The extracted file acts like a stand-alone dataset and can be used
within Tableau without having to keep a constantly updated connection with the original data source.
Step 5: Open Tableau and Import the Extracted Data:
As the extracted file in the course of the data extraction process, users will be able to import it into
Tableau. Develop dashboards, do visualizations, and perform in-depth exploration of the data using the
extracted dataset. Working with a
6.2 Field Operations:
Field manipulation is one of the essentials in Tableau for preparing data for analysis and visualization.
Organizing, resizing, and managing field are effectively made possible using these operations. In tableau,
fields are collectively referred to as columns which are classified as different types based on their nature
and function.
Dimensions are qualitative data that can categorize and segment data from the set. These include
dimensions like product name, category, country, and the like. They are non-mathematical, but they help
organize the data and group it. However, measures consist of the quantitative values which either can be
aggregated or can be used in mathematical operations on them. With measures, we can take quantity,
profit, and sales. These will generate our key performance indicators as well as statistics.
Besides these elements, there are also calculated fields, which allow a user to plug into a formula using
existing fields as the basis for customizing a calculation to suit a specific application. This feature is
handy when a user needs further transformation or business logic to be applied but does not want to
change the dataset or the source of data in the background. Another area of Tableau field operations is
aggregation. Fields that have been processed through sum, average, count, minimum, or maximum
operations above are known as aggregated fields. The advantages of aggregations are that they can
summarize huge data sets into smaller, readable segments.
165
Bhumi Publishing, India
166
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
6.3 Metadata in tableau:
Metadata in Tableau is the descriptive and structural data regarding a dataset that assists users in
organizing and understanding their data effectively. It consists of field names or column headers, which
identify the labels for various data attributes. These names are important for providing clarity and
organization while handling datasets. Every field in Tableau is given a data type, e.g., String, Number,
Date, or Boolean, that dictates how the data is computed and rendered. Choosing the right data type is
important because it impacts calculations, filtering, and sorting in the visualization.
Metadata also involves field properties, which determine if a field is of type Dimension or Measure and
whether it is Discrete or Continuous. Dimensions hold categorical data like names or locations, while
Measures store numeric values to be calculated. In the same manner, Discrete fields are used as distinct
individual values, whereas Continuous fields are stored as a range, often applied in graphs and time-series
analysis. It also includes calculated fields and aliases, through which users can derive new derived fields
against existing data. Hierarchies and relationships also expand metadata by specifying relationships
between tables and defining structured drill-downs, e.g., drilling down sales data into region, country, and
city. Metadata in Tableau can be edited by users through renaming fields, data type modification, proper
formatting, and organizing the dataset for performance and analysis. These improvements make data
clean, well-structured, and optimized for effective visualizations.
6.3.1 Editing metadata in tableau:
1. Renaming Fields:
The data source may have field names with ambiguous labels, spaces, or abbreviations. Renaming fields
makes them easier to use and comprehend.
To change a field's name:
Navigate to the Data Pane in Tableau > Right-click on the field you wish to rename > Select Rename and
input the new field name.
Example:
Change cust_id → Customer ID
Change ord_dt → Order Date
This enhances the readability and user-friendliness of the dashboard.
2. Converting Data Types:
Tableau performs automatic data types, but certain modifications may be required to carry out correct
analysis. To modify a data type:
Open the Data Pane > Select the data type icon beside the field name > Select the proper data type from
the list.
Example:
Change 20240101 (string) to Date Format (YYYY-MM-DD).
Modify the Discount from Integer to Decimal to calculate exactly.
Incorrect data types can cause analysis errors (e.g., analyzing Order Date as text rather than a date).
3. Changing the Default Properties:
Tableau allows you to change the default properties of fields, such as currency type, number format, and
aggregate. To change the aggregation by default:
Select a Measure field (like Sales) with a right-click > Select Aggregation under Default Properties >
Select SUM, AVG, COUNT, MIN, MAX, and so forth.
To change the format of a number: Right-click on a numeric field (e.g., Profit) > Click Default Properties
→ Number Format > Select Currency, Percentage, Decimal, etc.
Example:
Alter Profit to Currency Format ($1,000.00).
167
Bhumi Publishing, India
168
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
1. Understanding the Visualization Space in Tableau:
The primary visualization area in Tableau is the centre of the worksheet, where you visualize charts,
graphs, or maps. This is an empty canvas for transforming raw data into meaningful information. The data
presentation based on how various fields are placed on different shelves uniquely shapes the
corresponding view that breathes life into the story of its data.
Besides being an exhibition floor, this visualization space is also an interactive space where the user
interacts with data points. Instead of passive images, Tableau makes data objects interactive—for
instance, the user might click a bar in a bar chart to filter the related data or hover over a point in a scatter
plot to glean detailed insights. All these interactions make data exploration more intuitive and user-
friendly. It provides many visualization options, including bar graphs, line charts, scatter plots, maps, pie
charts, etc. With this flexibility, a user may opt for the best visual form for his dataset and the kind of
insight he wants to explore. Such flexibility is one of the strongholds of Tableau, enabling personalized
and meaningful data storytelling. Another strong feature of Tableau's visualization space is the
immediacy with which it responds to changes made by the user or underlying data. Filtering, changing
calculations, or working with a dashboard visually refreshes the Tableau views right away in a seamless
and dynamic manner. With this real-time flexibility, end users are given instant feedback, thus enabling
continuous iteration through their data exploration.
2. Shelves:
169
Bhumi Publishing, India
accessed via interactivity, such as in tooltips or drill-downs. By carefully placing data fields on these
shelves, users can customize their visualizations, add interactivity, and significantly improve the way they
present data in Tableau.
3. Marks Card:
170
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
improving legibility. Finally, Tooltips offer the ability to tailor the pop-up text that appears when marks
are hovered over, providing users with easy access to precise information without overloading the visual
design.
4. Data Pane:
171
Bhumi Publishing, India
172
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Chapter 7
CREATING DATA VISUALIZATION IN TABLEAU
Suruchi1, Gagandeep Singh2 and Arshdeep Singh3
1,2,3GNA University, Phagwara
173
Bhumi Publishing, India
glance. The chart can also be filtered by order date so the analysis can focus on relevant data. Other
custom options are available with the Marks card in the shape of color, size, labels, details, and tooltips.
Comparison is one of the chief applications of bar charts. Bar charts are good for making comparisons
between a large number of groups or categories, be it sales revenue across different products or regions.
Assuming a business needs to evaluate the overall revenue of sales for various product categories in a
given year, the product categories would be denoted on the X-axis, either on a less aggregated level, such
as furniture, electronics, and office supplies, for example, or on a very aggregated level, such as large,
small, or medium. The different heights of the bars, representing the income brought in by that category,
clearly show the area of performance comparison across different groups. Distribution Analysis, the next
important application, serves to show how one variable is distributed among various categories. One
specific instance could be that the bar chart can show the distribution of students' grades in a class where
each bar represents a grade range, and its height shows the number of students who received that grade.
This makes it simple to identify patterns such as clustering around certain scores or perhaps the presence
of outliers.
Finally, if time periods are treated as categories, bar charts can be employed for Trend Analysis. Rather
than the classic line graphs, bar charts can show trends over time by placing different time intervals-such
as months, quarters, or years-on the X-axis and corresponding values plotted on the Y-axis. This would
aid in registering all the sales, revenue, website traffic, or customer behavior concerning changes in the
different time periods.
7.1.2 Line Chart:
A line chart, or line graph, is a graphical representation of data that plots information as a sequence of
data points joined by straight lines. The charts are used to display the trends, patterns, and changes in a
continuous interval and are best employed for the analysis of temporal data. The x-axis is often employed
to plot time or a sequence, while the y-axis is employed to plot the measured values.
Data preparation:
Aligning the correct time-series data first before plotting a line graph. Date or time fields should be
formatted correctly, such as "YYYY-MM-DD" or "Month-Year," to allow for proper trend analysis.
Check for missed values or other discrepancies that can disrupt visualization. Clean data for unobstructed
trend representation without misleading gaps found on the chart.
Drawing the Line Chart in Tableau:
Connect the data source to Tableau and open the new worksheet to begin. From there, drag the date or
time field into the Columns shelf, which refers to the dimension defining the x-axis, time, and the
numeric measure-or indicator at y-ax at which value over time is represented, such as sales, profit or
revenue.
Tableau may choose the default automatic visualization type. Go to the Marks card, click the dropdown
menu, and choose "Line" to change to a line chart. In situations where the date field has several levels
(e.g., year, quarter, month, day), drill down or aggregate the data at other levels using Tableau's date
hierarchy.
Improving the Line Chart
For multi-measure comparisons such as adding multiple numeric fields either to the Rows shelf or Marks
card will render each appearing in different lines within the same graph. One may improve colors to
delineate lines by dragging the measure or category to Color shelf in the Marks card. Markers may be
174
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
added at select data points to highlight significant values such as a drop or peak within a month. An
additional avenue for forecasting is adding trendlines or forecasting features to the data within Tableau,
which anticipates the new or emerging trends by analyzing historical information. Such primary
enhancements for decision-making significantly improve thorough insights.
Add annotations to highlight noteworthy points:
175
Bhumi Publishing, India
reads time intervals on the X-axis, while corresponding values appear on the Y-axis; that is the most
common way to assess trends, try to review predicted future movements, and spend time identifying
seasonality in data.
7.1.3 Pie Chart:
Imagine a pie chart as cutting up a real pie at dinner with your family - everyone receives a slice, and you
can immediately tell who received the larger slices! That's precisely what pie charts do with information.
The underlying premise of pie charts is that they can represent part-to-whole relationships in a visually
engaging format. By representing data as relative portions of a pie, pie charts take advantage of our
natural feel for relative sizes and angles and enable readers to better understand the way particular
portions relate to a whole set of data. Such graphic representation works particularly well if the goal is to
highlight relative sizes of multiple categories of one variable.
In data visualization, pie charts play a unique role of emphasizing the part-whole relationship rather than
attempting to make precise numerical comparisons.
Creating Pie Charts in Tableau (The Practical Way):
176
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
3. Customizing the Pie Chart:
The pie chart can be improved for understandability and visual appeal by applying various
customizations. While dragging the "Category" dimension to the Label shelf would ensure that category
names are shown on slices, placing Sum of Profit on the Label shelf would ensure profit values are
visible. Further formatting can be done by right-clicking on the label and selecting "Format" to refine text
size, alignment, etc. The default colors assigned to each category by Tableau could be customized by
clicking the Color shelf in Marks card and selecting "Edit Colors". This customization will allow better
visualization to distinguish between categories. Tooltips, which display detailed information by hovering
over the slice, can also be modified by clicking on the Tooltip shelf and editing the text to provide extra
insights. Further, the Size option in the Marks card allows altering the overall size of the pie chart itself,
which ideally aids in visual unequal weighting.
4. Interpreting the Pie Chart:
Examining the pie chart after creation and customization provides some useful information. For example,
if using sales data by category, the pie chart paints a picture of how profits behave across the groups of
Furniture, Office Supplies, and Technology. The size of each slice depicts the measure of profit given to
each category and allows for straightforward comparison. The absolute profit value (for example,
SUM(Profit) = 292,297) is normally shown, helping the reader comprehend company performance status
at a glance.
7.1.4 Scatter Plot:
A scatter chart, or scatter plot or scatterplot, is a visualization tool that illustrates the relationship between
two quantitative variables. Every point on the chart is a single data point whose position is defined by its
coordinates on two axes (x and y). Scatter charts are great at showing patterns, correlations, clusters, and
outliers in datasets and are extremely useful for statistical analysis and research in many disciplines.
Scatter plots are the only type of chart that can visually show hundreds or thousands of separate data
points on one chart at a time and give us such valuable insights into data relationships.
Creating Scatter Charts in Tableau:
For starters, drag the first measure to the Columns shelf, which defines the x-axis, then drag the second
measure to the Rows shelf, which sets the y-axis. That sets up a two-dimensional space that plots data
points based on these measures. Tableau may not immediately grant you a scatter plot but rather keep any
mark type by default. Hence, click the dropdown box on the Marks card to select "Circle" or "Shape" so
that the individual data points are clearly displayed. Adjust point size or opacity if required for an
effective view, especially with many data points.
Enhanced scatter plot features for analysis:
Adding a trend line aids in the identification of patterns and correlations. This can be done from the
Analytics pane by dragging a Trend Line onto the scatter plot, facilitating the visualization of the strength
and direction of the relationship between the two variables. A scatter chart can also use color, size, or
labels to express additional variables. By dragging a categorical field onto the Color shelf, data points are
differentiated by groups for easy observation of clusters. If a third numerical variable is to be represented,
sizing the points differently will reflect the differences in magnitude. Labeling specific data points, such
as outliers or highlighted trends, can give more context and improve readability.
177
Bhumi Publishing, India
178
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
all the bubbles on the chart. The third numeric measure is used for bubble sizing, and it is imperative that
this value is positive in order to portray it properly.
If desired, a categorical dimension can be placed on the Color shelf to distinguish groups within the data
and assist in the visual identification of patterns or categories. Labels can also be added for direct
identification of the bubbles, which becomes especially helpful in accentuating some important data
points.
Technical Considerations in Tableau:
Bubble charts are numerical representations; hence, all measures used to plot the x-axis and y-axis should
be continuous about the categorical options. This allows Tableau to plot them correctly in the coordinate
system. The measure used for bubble size should also be numerical with values accepted only for
positives, thereby disallowing any representation on the size of the bubble for negatives. Well-formatted
and well-scaled data is also a very crucial factor. Inconsistent scaling can, at times, make the bubbles
appear much larger or smaller than other bubbles, thus skewing the interpretation of the data. Tableau
allows for customization of bubble appearance and client interface for better readability.
Best Practices for Implementation:
Before being able to create an effective bubble chart, it is very important to prepare and clean the data. If
the measures used for bubble size differ markedly in scale, standardizing the values may help to retain
proportionality. The removal of extreme outliers is important in avoiding distortion, as one single
extremely large value can make all the other bubbles look too small. It is equally important to ensure that
all measures are on a compatible scale for proper representation. In some cases, transformation of the data
could be done, for example normalizing the values or log scaling, so that visualization will be clearer.
179
Bhumi Publishing, India
1. Data Preparation:
Initial consideration was given to connecting the data source in Tableau for purposes of mapping
information. After establishing the connection, the following essential fields must be chosen for
visualization: a categorical dimension (like "Category") that specifies each bubble, and a numerical
measure (like "Sales") for determining the size of the bubble.
2. Creating the Bubble Chart:
Begin with dragging the "Category" dimension to the Color shelf of the Marks card. This means that a
different color will be assigned to the bubbles based on different categories, thus enhancing clarity and
differentiation. Next, drag the measure "Sum of Sales" onto the Size shelf on the Marks card, which will
size each bubble proportionately based on total sales for that category. Dragging the Category dimension
to the Label shelf will label each bubble with the name of its category. It is only necessary to check that
the proper mark type is set to "Circle" since Tableau will automatically assign a default chart type.
Choose "Circle" from the drop-down on the Marks card, and Tableau will take care of the rest.
3. Customizing:
Many customization options are available in Tableau to enhance the readability and presentation of the
bubble chart. You can define the category colors according to your preferences by clicking on the Color
shelf and selecting “Edit Colors.” Bubbles can be made smaller or larger according to your liking by
editing the size on the Size shelf.
Tooltip information is also encouraged, as it facilitates better interactivity with the visualization and
allows further information to display when hovering over a bubble. The indicator "Sum of Sales"
automatically goes into the Tooltip shelf; however, further editing can be performed on the text and
formatting features if you were to click on Tooltip and edit from there.
7.1.6 Gantt Chart:
A Gantt chart is a bar chart on the horizontal axis used specifically to display project timelines and
progress across time. A Gantt chart is named after Henry Gantt and represents activities or tasks along the
y-axis and their respective time periods along the x-axis. Each bar represents a task's duration, with the
bar's position and length indicating when the task starts, how long it lasts, and when it ends."
Creating Gantt Charts in Tableau:
To create a Gantt chart in Tableau, project data should be connected. The date field, Start Date, would be
dragged to the Columns shelf to form the timeline. The Task/Activity field would then be dragged onto
the Rows shelf to see all project tasks. The type of mark will be changed to Gantt Bar in the Marks card.
Starting Date is added to the Columns shelf, while Duration or End Date is added to the size shelf in order
to denote the length of each task bar. Further customization can be achieved by changing bar color
according to various categories or statuses.
In addition to this, for enhanced functionality, we might differentiate tasks from one another with color
based on categories or progress. Such color coding will assist in distinguishing between various groups of
tasks. Any dependencies between tasks can also be shown in order to show how phases of a project are
linked. While tracking task progress, the percentage completion indicator may be added. Important
deadlines can be highlighted using milestone markers while hierarchical classification can be adopted to
group tasks that are related to one another for clarity.
180
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Great usage of the Gantt chart provides a clearer picture in terms of how projects should be managed; in
short, a real-time visual for tracking deadlines and executions while ensuring the employees know about
the tasks at hand.
181
Bhumi Publishing, India
Marks Card gives relevant options for customizing the appearance of the Gantt chart. A checkbox permits
the classification of the chart as automatic, whereby Tableau will select the most appropriate chart type
with respect to the structure of the data. It thus permits further customization in the areas of Color, Size,
Label, Detail, and Tooltip, making the visualization interactive and informative.
7.1.7 Histogram:
The image presents a Gantt chart visualizing the Sub-Category of products and their Ship Mode over
Weeks based on the Order Date. Here's a breakdown:
Creating Tableau Histograms:
An easy way to build a histogram in Tableau is to pick "Histogram" from the Show Me menu or by
dragging a measure (numeric field) to the Columns shelf. You also could right-click on the measure,
select Create → Histogram, whereby Tableau will automatically create bins so as to group the data into
ranges. The height of each bar in a histogram reflects the frequency count of values inside a bin. To edit
the bins, simply click on the binned field in the Data Pane and select Edit. You can change the bin width
to control for data grouping and the number of bins to make the representation better. Choosing a
reasonable bin width ensures that the histogram presents salient features of the dataset pattern without
overgeneralizing or overcomplicating the view.
The result of the fine-tuning of the histogram bin sizes and the number of bins leads to a more specific
and insightful representation of the data distribution. It ensures that any trends, outliers, and general
spread of the data come into the light for easy analysis and interpretation of its patterns.
182
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
into bins, end users can compare how they are able to analyze the frequency distribution of various
quantity values. The Y-axis (Rows) depicts Count per Quantity, CNT(Quantity), which is all the data
points that represent the frequency within that bin range. It provides an indication of the amounts per
quantities visible in the dataset. The bars in the histogram each represent respective bins, and the height of
the bar is proportional to the number of data points (orders) that fall within that specific bin range. Thus,
taller bars have a large number of orders in that specific quantity range while shorter bars signify few.
This chart has no applied filters, meaning it holds all the datasets with no restriction. This gives the full
view of quantity distribution over all available data points. The additional customization offered by the
Marks Card. The type of the chart is Automatic so that Tableau can determine the best styling according
to the data set. It can further be customized by the user with Size, Label, Detail, Color, and Tooltip
making it readable and capable of highlighting key insight. Such flexibility will allow end-users to
modify how the histogram is presented, making it more informative and good-looking.
7.1.8 Waterfall Chart:
Waterfall charts are complex visuals aimed at showing the cumulative effect of sequential trades in
increase with those in decrease affecting an initial value, leading to a final total. Such an instrument is
particularly useful for analyzing changes in financial data, performance metrics, or any scenario where
intermediate contributions affect an overall result. Key in a waterfall chart is tracking changes at every
stage in the dataset. In essence, it portrays how much each of a number of individual positive and negative
contributions influences the overall trend. Generally, increases are designated one color (for example,
green) while decreases are called out in a contrasting color (here it might be red) thus allowing the viewer
to assess the contributions of each flow easily.
The other very important feature comes with respect to the starting and finishing points that quickly walk
you through the steps from the initial value to the final cumulative result. Each step on the chart
represents a category or time period within that category through which each stage affects the total. For
instance, a waterfall chart that contains sales information displays how different product subcategories
affect the total revenue, thereby informing the company about factors affecting either profitability or loss.
Fig 7.8 presents a waterfall chart visualizing the cumulative Sum of Sales across different Sub-Categories.
The break can be seen below:
In this waterfall chart visualization, the changes in sales across different sub-categories can easily be
interpreted using several key features. The X-Axis (Columns) represents the Sub-Category, presenting all
the different product categories along it. Each sub-category corresponds to a different point along the axis
thus making comparisons easy as to the sales changes among different categories. The Y-axis (Rows)
depicts the Sales Running Sum, which is the grand total running process through each sub-category. The
running sum can hence track how the sales evolve under different categories giving an indication of the
trends and contributions. The Gantt Bars in the chart finally present the individual changes in sales for
every sub-category. Their sizes are determined by the magnitude of the change, while the direction they
take indicates whether such a change is an increase or decrease for the sub-category in question. Color
differentiates such increases and decreases in sales. Here, small bars in green represent an effect on sales
in terms of improving sales and thus positively contributing to the running sum. This makes it so easy to
see at a glance. There are not any individually defined filters in the chart. As a result, all sub-categories
are included without restriction, thus allowing for a complete view of sales patterns across various
products. The Marks Card is another feature that gives more possible options for the enhancement of the
183
Bhumi Publishing, India
visualization. This is defined by the type of the chart, which is Gantt Bar. Customization may be made at
Color, Size, Label, Detail, and Tooltip extend to better presentation of the data. Such changes can help
with readability and insights.
184
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Key Features of Tableau Dashboards:
Tableau dashboards give the functionalities involved in interactive filtering, one of the best features.
Users can apply filters or parameters to change what data is shown dynamically, allowing analysis to be
further refined according to specific criteria by drilling down to particular aspects of an entire dataset, for
instance, viewing sales figures for a specific region or filtering data by time periods. Another very
important functionality is known as dashboard actions, allowing end-users to interact with the charts and
tables. A single action, such as clicking on a data point within one visualization-type (like bars in a bar
chart), can cause other visualizations to update. Tableau has URL actions to enable launching of external
links, which is a good way of giving further context or providing resources. Custom-designed dashboards
for different devices are offered by Tableau for better usability so that the dashboards can be customized
for desktops, tablets, and mobile devices. This further ensures that the dashboards respond and adjust to
various screen sizes while providing the user with good accessibility and readability. Tableau also enables
storytelling with dashboards for advanced reporting and presentations, whereby multiple dashboards can
be pulled together into a coherent narrative. This feature is particularly useful for giving step-by-step
analyses that communicate insights and trends with ease.
7.2.1 Building a Tableau Dashboard:
In the process of building a Tableau dashboard, data is imported from various sources such as Excel, SQL
databases, or CSV files. Once connected, the requisite datasets are loaded into Tableau's workspace and
becomes ready for visualization.
Individual worksheets are created with each carrying a particular type of visualization, like bar graphs,
line graphs, or pie charts. These worksheets are representations of specific aspects of the data, for
instance, "Sales by Region" or "Profit Trends" which will later be assembled into a dashboard. Once
visualizations are ready, the next step is to create a brand-new dashboard by going to the "Dashboard"
option in Tableau. Following this, individual worksheets are dragged and dropped into the dashboard
workspace and arranged into a layout of their own choosing. Users can resize and position the elements
according to their analysis requirements.
To do this effectively, some interactive elements like filters, parameters, and dashboard actions could be
added for better usability. These features increase interactivity, allowing users to dynamically explore the
data from different angles. The formatting will also be applied to enhance clarity and maintain a
consistent design.
Types of Tableau Dashboards:
1. Dashboards for operations:
Summary of the real-time insight dashboards include ongoing business processes; it seems that they do
not only make it possible for users to see activities but also performance metrics and key indicators in an
instant for daily activity performance tracking. Indeed, these dashboards are essential requirements for
managers' and executives' activities of tracking all progress, bottlenecks identification and action
realization and thereby promoting any rising change effort. A good example of this would be sales
operations' daily orders, revenue, customer activity, and sales performance by region-all critical
dimensions in support of surfacing sales trends, identifying the best-selling products, and readjusting sales
strategies. In a facility, one could also see an operational dashboard for monitoring production rates,
machine downtime, and efficiency in supply chains, providing an organization with the flexibility in
responding to imminent problems.
185
Bhumi Publishing, India
Most operational dashboards have KPIs such as the amount of order fulfilment, inventory levels, and
employee productivity indicators to enable businesses to plan ahead in their operations. Data is frequently
refreshed, sometimes even in real-time, to ensure that users benefit from the latest information.
2. Dashboards for analysis:
Analytical dashboards provide long-term analysis of performance, historical trends, and predictive
insights. These dashboards are for a deep-dive investigation into various views that help businesses detect
patterns that may not be immediately seen in the routine operations of the organization. In the case of a
financial analysis dashboard, years of revenue growth, trends in profit and loss, and budget variances
could all be tracked as illustrated in fig 7.9. Here, financial analysts will assess how the company is doing,
recognize spending patterns, and estimate future financial results.
In a similar fashion, a customer retention dashboard could track customer churn rates, levels of
engagement, and satisfaction ratings for marketing teams as they refine their strategies to enhance
customer loyalty. Besides, analytical dashboards would allow businesses to analyze trends like seasonal
sales fluctuations, the effectiveness of marketing campaigns, or employee performance over time.
Because analytical dashboards work on historical data, they usually employ advanced visualizations,
statistical models, and forecasting tools to get even deeper insights. These dashboards enable decision-
makers to build strategic plans based not on gut feeling but on systematic evidence.
186
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
As an example, a performance dashboard for a company might contain metrics concerning income,
expenditure, and newly acquired clients every quarter. This enables an executive to know the sources of
revenue, manage costs well, and analyze the acquisition of customers. Most strategic planning dashboards
could be monthly-or yearly-interval refresh rates. So, companies can input accurate data and draw data-
driven decisions according to their long-term targets.
4. Strategic Dashboards:
Where executive dashboards exhibit KPIs at a high level, those for strategy by middle management are
concerned with efficiencies at all levels of operation. Thus, these dashboards offer extensive views into
individual departmental performances to ensure all that managers have the relevant views and points
necessary for actions in day-to-day decision making.
An example is that a customer service dashboard could have different customer satisfaction scores, the
status of a ticket, market shares (as shown in fig 7.10) and the time it takes to respond to inquiries. Stores
these types of data will give managers the picture through which they can gauge the trends in customer
inquiries, identify problems in service delivery, and, at the same time, better distribute resources. It is
when either response times grow longer or customer satisfaction scores take a dip that the dashboard
provides instant visibility, which allows teams to take corrective actions in real-time.
The dashboards also complement workflow optimization and resource planning for targets to be set by
managers, performance evaluation of staff, and strategy deployment for improvement. Integrating
interactive elements like filters, drill-down reports, and alerts increases usability and will typically allow a
focus on a particular area of concern. With such applications in strategic dashboards, smoother operation
and higher productivity would follow and subsequently build a basis for increased customer satisfaction.
187
Bhumi Publishing, India
7.3 Formatting:
Formatting also works for aesthetics. It will beautify and clarify your visualizations. Formatting is all
about setting adjustments of colors, fonts, borders, alignments, and other visual components in improving
the manner dashboards are read and appear professionally. With proper formatting features, reports can
look good and be straightforward to interpret. Proper formatting gives a whole structure and a lively way
of presenting data insight as shown in fig 7.11, thus being easy for users to analyze trends, comparing
values, and making decisions based on data.
188
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
modify. In this Menu, the incorporation of shading, borders, fonts, and any other display characteristics
would enhance the whole visualization aspect from what their work would entail.
Font style and size modification are among the key worksheet formatting features. Borders and gridlines
can be added or removed for a more structured and visually appealing workspace. Customizing
background color is also an important aspect of making improvements regarding contrast as well as
highlighting crucial data points.
Text Formatting in Tableau:
Text format is an aspect in designing through Tableau. One can personalize tooltips, axis labels, titles, and
captions. Proper text makes data easy to read and interpretable when one needs to draw in key insights.
To be able to format text, one needs to select a specific text object, whether it be tooltips, titles, or labels.
One can simply right-click on the text to get to the Format option and thereafter change alignment with
font size, color, and style to design with users' preferences.
For instance, if axis labels have been made bold and enlarged, they can be easy for viewers to gather
meaning out of. Further, it is also necessary to use consistent font styles across the dashboards and
worksheets so that everything looks uniform and professional.
Number Formatting in Tableau:
Good number formatting is important as far as manipulating numerical data is concerned, so one presents
it very understandably. Number formatting in Tableau is in various forms such as decimal, currency,
percentages, or even scientific notation, depending on the data
189
Bhumi Publishing, India
Chapter 8
ADVANCED FEATURES IN TABLEAU
Jeevanjot Singh Chagger1, Manpreet Singh2 and Paramjit Kaur 3
1,2Sant Baba Bhag Singh University, Jalandhar
3Rayat Bahra Institute of Engineering and Nanotechnology, Hoshiarpur
190
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Key Features of Data Blending:
i. Non-destructive: The original datasets remain unchanged.
ii. Dynamic Integration: Data sources are linked in real-time.
iii. Maintains Granularity: Blending works even when datasets have different levels of detail.
iv. Flexible: Works with structured (databases, Excel) and unstructured (APIs, logs) data.
How Data Blending Works?
Data blending refers to the real-time combination of data from multiple sources within visualization.
Data joining, where data sets are permanently combined, differs from data blending, as data blending
does not combine the sources but combines them at the visualization level on a shared key.
1. Understanding the Process of Data Blending:
Step 1: Determine Primary and Secondary Sources:
Primary Data Source: The main data source used in the visualization.
Secondary Data Source: The second set of data about the primary source.
The main source determines the structure of the visualization, and the secondary source has extra
information.
For example:
i. Primary Source: Sales transactions from a firm's database.
ii. Secondary Source: Google Analytics website traffic data.
iii. Common Key: Date.
Step 2: Identify the Common Key (Blending Key):
There must be a common key to combine data appropriately. The common key should be present in
both data sets. Common keys can be:
i. Date → Used to compare daily visits to the website with daily sales.
ii. Customer ID → Utilized to link customer buying with customer demographics.
iii. Product Code → Used to reconcile sales amounts with quantities available.
The common key ensures that data points are correctly matched between the two datasets.
Step 3: Import the Datasets into a Visualization Platform:
Blending data is performed in data visualization tools such as Tableau, Power BI, Looker, and Google
Data Studio. The tools allow:
i. Import multiple data sources into a project.
ii. Allocate a secondary and primary data source.
iii. Create the key for merging to link the datasets.
Example in Tableau:
i. Load Sales data as the initial source.
ii. Load Website Traffic Data as the Secondary Source.
iii. Make "Date" the key for blending.
iv. Tableau automatically links matching records.
Step 4: Matching and Aggregating Data:
As data blending is dynamic, certain rules hold:
i. The primary source powers the visualization, i.e., only information available in the primary
source will be shown.
ii. The secondary source is collected based on the common key.
191
Bhumi Publishing, India
iii. Aggregations like SUM, AVERAGE, and COUNT are used in the event of multiple matches in
the secondary source.
Example of Aggregation During Blending:
Date Sales Revenue (Primary Source) Website Visits (Secondary Source)
01-Mar $5,000 1,000 visitors
02-Mar $3,200 800 visitors
03-Mar $4,500 950 visitors
Here, website traffic data is aggregated by date before blending with sales revenue.
Step 5: Visualization of the Blended Data:
Once the datasets are combined, they are utilized to build charts, dashboards, and reports.
For instance:
i. Line Chart: Sales revenue vs website visits over time.
ii. Scatter Plot: A review of how web traffic relates to sales.
iii. Bar Chart: Illustrating average sales revenue per visitor interaction.
Visualization software does blend on the fly, so the data is separate but shown together.
Example Use Cases for Data Blending:
The relationship between the datasets is not ideally structured for joining. In Tableau, data blending
employs a primary-secondary model where:
Primary Data Source: The main dataset that forms the basis of the visualization.
Secondary Data Source: The additional dataset that is linked to the primary source via a shared field (e.g.,
Customer ID, Date, Product Name).
i. Lets consider a scenario, there is two different datasets which stored in separate sources:
Table 1: Sales Data (Primary Data Source- MySQL Database)
Order Id Date Product Sales Country
001 March 1 Earphones $50 UK
002 March 2 Phone $300 Canada
003 March 3 Laptop $150 USA
004 March 4 Speakers $200 UK
192
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
taken from the Excel file has been leveraged as the auxiliary data source. With both data sources now
connected, Tableau looks for a common field between the two data sources to establish a potential
relationship. Here, that common field is "Country," which exists in Sales Data (MySQL) and Customer
Data (Excel). Tableau recognizes this relationship and shows the common field with an icon for linking in
the pane of data. The icon confirms that Tableau has successfully recognized a linking field for blending
data between the two sources.
Now that the relationship has been established, users may begin to develop their visualization. It is very
straightforward. Users drop fields from their primary datasource (Sales Data) into the Tableau worksheet
in order to generate charts, tabulations, or any other visual forms. After the main dataset is in place, users
may proceed to add fields from the secondary source, and Tableau will carry out the blending of
dependent data automatically based on the common country element. This tight integration ensures that
the data from different sources meaningfully combine so that metrics can be analyzed across the datasets.
Data blending is highly applicable in Tableau, especially when dealing with data from bachelor databases,
working on different levels of granularity, or cross-source comparisons. Hence, this must-have feature
enables the users to derive key insights in a more effortless manner and take data-driven decisions.
Output: Blended data (After data blending in Tableau)
Country Total sales Customer Segment
UK $250 Consumer
Canada $300 Small business
Italy - Consumer
USA $150 Enterprise
Observations:
a. Itlay is not included in merged data as it does not occur in the sales data.
b. Aggregation precedes blending; thus the sales are added up by country.
c. Two sets of data from different sources (SQL + Excel) are combined harmoniously without
database-level joining.
ii. Marketing Analytics: Comparing Facebook ad impressions with e-commerce sales.
iii. Retail Analysis: Merging website traffic with in-store purchase data.
iv. Finance & Banking: Blending stock market data with external financial news.
v. Healthcare Analytics: Integrating patient medical records with wearable device data.
iii. Limitations of Data Blending:
a. Performance Issues: Bringing together large datasets can be slow for dashboards.
b. Requires Shared Key: Blending is impossible without a shared key.
c. Limited to Visualization Tools: Blending is temporary and cannot be utilized outside the realm
of visualization.
d. No Full Outer Join Support: Data merging is only possible in case there exists a common key
in the parent table unlike database joins.
8.1.2 Data Joining in tableau:
Combining datasets is the process of bringing together numerous different sources of data into one dataset
to study and graph. It is a process that can be used by companies dealing with data from varied sources,
shapes, and kinds but which have to research them as a whole.
193
Bhumi Publishing, India
In Tableau, data joining is a crucial step that allows you to combine data from different tables using a
common field (key column). It helps to combine different data sources, improve databases, and create
valuable insights from a variety of data points. Joining is especially useful when dealing with relational
databases, spreadsheets, or various data tables containing related information separately.
Let's take the following example:
Orders Table (Order ID, Customer ID, Order Date, Sales)
Customer Table (Customer ID, Customer Name, Region), respectively.
Joining these tables based on Customer ID, we are able to develop a full dataset with order details and
customer details.
Why Combine Datasets?
Organizations have a higher probability of having discrete data from assorted sources like databases,
spreadsheets, APIs, and cloud stores. Integrating these data collections assists in
i. Holistic Analysis: Combining customer purchases with site visits for greater insight.
ii. Comparative Studies: Comparing company sales and market trends.
iii. Shattering Data Silos: Integrating fragmented data for informed decision-making.
iv. Enhanced Reporting: Creating standardized reports with multiple sources of data.
Methods of Combining Datasets:
Merging data sets is necessary for creating comprehensive insights from several data sets. The technique
you use depends on the data structure, the data sets' relationship, and the analysis type you wish to
perform.
The four primary methods by which datasets are combined are as follows:
1. Appending (Union): Stacking similar datasets together
Appending (also referred to as union) merges datasets by placing rows one above the other when both
datasets share the same structure (column names and data types).
Use Case:
i. Combining sales information across different stores or time periods.
ii. Consolidating customer lists from sources.
Example:
Dataset 1 (North Region Sales)
Date Sales Region
Jan-01 500 North
Jan-02 600 North
194
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Final Combined Table (After Appending)
Date Sales Region
Jan-01 500 North
Jan-02 600 North
Jan-01 450 South
Jan-02 550 South
Key Considerations:
i. Column names and types must match in both datasets.
ii. If columns differ, tools like Power BI, Tableau, or SQL allow column alignment.
2. Joining (Merge): Combining datasets based on shared keys:
Joining combines data horizontally by matching rows on a common key (e.g., Customer ID, Date,
Product Code).
Use Case:
i. Blending customer information with transaction history.
ii. Connecting employees to departments
Example:
Dataset 1: Customer Information
Customer ID Name Country
101 John USA
102 Alice UK
Dataset 2: Order Details
Customer ID Order ID Order Amount
101 A001 $500
102 A002 $400
Final Combined Table (After Joining on Customer ID)
Customer ID Name Country Order ID Order Amount
101 John USA A001 $500
102 Alice UK A002 $400
8.1.2.1 Types of Joins:
Joins in Tableau enable the merging of multiple tables based on a shared column. This is necessary when
dealing with relational databases or structured data that has related data split across several tables.
Tableau has four join types as shown in the fig 8.2 that define how two tables' records are merged.
195
Bhumi Publishing, India
1. Inner Join:
An Inner Join will only return records that share common values in both tables. It will never return
any data that lacks a common value in the second table.
How It Works:
i. If Table A is customer orders and Table B is customer information, an Inner Join will give us
only customers who have orders.
ii. Any customers present in Table B but without orders in Table A will be excluded.
Example:
Table A (Orders):
Order ID Customer ID Order Amount
101 C1 $500
102 C2 $300
103 C3 $700
Table B (Customers):
Customer ID Customer Name Country
C1 John Doe USA
C2 Alice Smith Canada
C4 Robert King UK
Inner Join Output (Only Matching Records):
Order ID Customer ID Order Amount Customer Name Country
101 C1 $500 John Doe USA
102 C2 $300 Alice Smith Canada
Note: Customer C4 is missing from the result because there is no matching order in Table A. Similarly,
Order 103 (C3) is missing because there is no matching customer in Table B.
2. Left Join:
A Left Join is used to fetch all records from the left table (Table A) and only the corresponding records
from the right table (Table B). In cases where there is no match, NULL values will be returned for the
columns coming from the right table.
Mechanism of Functioning:
If Table A has all the sales orders and Table B has customer details, then there will be a Left Join where
all the orders are there, even if there is incomplete customer data.
Example:
Table A (Orders):
Order ID Customer ID Order Amount
101 C1 $500
102 C2 $300
103 C3 $700
Table B (Customers):
Customer ID Customer Name Country
C1 John Doe USA
C2 Alice Smith Canada
C4 Robert King UK
196
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Left Join Output (All Records from Table A, Matched from Table B):
Order ID Customer ID Order Amount Customer Name Country
101 C1 $500 John Doe USA
102 C2 $300 Alice Smith Canada
103 C3 $700 NULL NULL
Note: Customer C3 is included in the result, even though it does not exist in Table B. Tableau fills the
missing values with NULL.
3. Right Join:
A Right Join brings back all the records from the right table (Table B) and only those records from the left
table (Table A) that correspond. Where there is no corresponding record, Tableau puts NULL values in
the columns from the left table.
Mechanism of Action: In the case where Table A contains sales orders and Table B contains customers,
a Right Join ensures the presence of all customers either individually or otherwise.
Example:
Table A (Orders):
Order ID Customer ID Order Amount
101 C1 $500
102 C2 $300
103 C3 $700
Table B (Customers):
Customer ID Customer Name Country
C1 John Doe USA
C2 Alice Smith Canada
C4 Robert King UK
Right Join Output (All Records from Table B, Matched from Table A):
Order ID Customer ID Order Amount Customer Name Country
101 C1 $500 John Doe USA
102 C2 $300 Alice Smith Canada
NULL C4 NULL Robert King UK
Note: Customer C4 appears in the result, even though they have no orders in Table A. The missing values
are represented as NULL.
4. Full Outer Join:
Full Outer Join returns all of the records from the two tables, with or without matching. NULL values are
provided if there is no match for a record in the other table.
How It Works:
If Table A contains orders and Table B contains customers, then a Full Outer Join contains all orders and
all customers with non-matching entries on NULLs.
197
Bhumi Publishing, India
Example:
Table A (Orders):
Order ID Customer ID Order Amount
101 C1 $500
102 C2 $300
103 C3 $700
Table B (Customers):
Customer ID Customer Name Country
C1 John Doe USA
C2 Alice Smith Canada
C4 Robert King UK
Note: Both unmatched orders and customers appear in the result with NULL values.
To Conduct Data Joins in Tableau:
1. Connect to Your Data Source: Open Tableau and connect to the database or spreadsheet of your
choice.
2. In the Workspace, move tables. You can drag a table to the Data Pane area. Next to it, set the
second table.
3. Specify the conditions for joining: Based on shared fields, Tableau will automatically suggest a
join condition. If required, manually modify the Join Key (such as Customer ID).
4. Select the Join Type: Pick from Inner, Left, Right, or Full Outer Join according to your needs.
5. Preview the Joined Data: Check to ensure the join is functioning as expected. Investigate any
NULL values, missing entries, or duplicates if present.
3. Blending: Dynamically combining datasets from different sources:
Blending integrates data dynamically into a visualization tool instead of merging it physically.
Use Case:
i. Combining real-time marketing metrics (Google Analytics) and in-house sales reports
ii. Integration of diverse database sources without changing existing data
Example:
A business analyst would like to examine the impact of Google Ads expenditure on sales.
i. Ad Data (Google Analytics): Contains Clicks, Impressions, and Cost per Click.
ii. Sales Data (SQL Database): Order Amount and Customer Information.
iii. Shared Key: Date
Data Blending allows the two sources to remain separate but see them together.
198
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Key Considerations:
i. Primary dataset determines the layout of visualization.
ii. The secondary dataset is pre-aggregated before blending.
iii. Only matching data from the primary source is used.
4. Aggregation & Transformation: Granularity change before joining:
Certain data sets need to be aggregated or rescaled before merging because they have differing levels
of granularity (e.g., monthly vs. daily sales).
Use Case:
i. Converting the transaction data into monthly aggregations before the merge.
ii. Data cleaning and standardization for greater accuracy.
Example of Aggregation Before Merging:
Daily Sales Data (Before Aggregation)
Date Sales
Jan-01 100
Jan-02 200
Jan-03 150
After Aggregating to Monthly Sales:
Month Total Sales
Jan 450
199
Bhumi Publishing, India
200
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
201
Bhumi Publishing, India
Fig. 8.5: Bar graph representing relation of number of patients and days of week
1. Education and E-Learning Analytics:
Educational institutions make use of data visualization to track levels of engagement, enrollment rates,
and student achievement. It helps in:
•Monitoring the progress of students.
•Identifying areas in which students struggle.
•Measuring teacher effectiveness.
Frequent Educational Visualizations:
1. Student Performance Dashboard: Bar charts display grades for various subjects.
2. Trends in Enrollment Over Time: Admissions over the year are represented by line graphs.
3. Drop-Out Rate Analysis: Pie charts represent the reason for dropouts.
For example, Use Case: A university utilizes Tableau to analyze the performance of the students by:
• Making a bar graph representing passed and failed students in various subjects as shown in fig 8.6.
• Making a line graph to compare attendance trends.
• Utilizing a scatter plot to find associations between study hours and exam results.
202
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
203
Bhumi Publishing, India
204
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Chapter 9
STORYTELLING WITH DATA VISUALIZATION
Navdeep Kaur1, Navjot Kaur Basra2 and Sumit Chopra3
1,2,3GNA University, Phagwara
Data visualization is more about storytelling and less about graphs and charts. Having the ability to
transform dense data into short, compelling, and actionable insights is a value. The principles of good
data storytelling are the topic of this chapter, followed by case studies on applying visualization
techniques in different domains and good data storytelling as shown in the given fig.9.1.
205
Bhumi Publishing, India
206
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Ends with something actionable or suggestive.
A strong narrative aid in making data more relatable and ensuring the audience can relate to
the message.
9.1.3 Principles for Effective Data Storytelling: Design Principles for effective data visualization are
given in the figure 9.3.
207
Bhumi Publishing, India
by industries based on the nature of data they have and the purpose. The following are key areas where
visualization plays a significant role, with the right methods and examples.
1. Business & Finance:
Companies and financial institutions are dependent on data visualization to track performance, monitor
financial trends, and make improved decision-making. Executives, investors, and analysts require
transparent, real-time insights to evaluate business health and market dynamics.
Key Visualization methods:
i. Line Charts: It employed to illustrate trends in expenses, revenues, or stock prices over some
time.
ii. Bar Charts: It compares financial performance between various periods or business segments.
iii. Pie Charts: This illustrates the percentage of revenue contributed by different products or
services.
iv. Heatmaps: They determine the most profitable and underperforming areas or business segments.
v. Dashboard Reports: It combines several visualizations for an overall business overview.
208
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Strong visualizations assist policymakers, scientists, and healthcare workers in spotting trends and in the
effective distribution of resources.
Major Visualization Techniques:
209
Bhumi Publishing, India
Example:
A digital marketing firm examines website heatmaps to see where visitors click most, thereby enhancing
webpage design and user experience.
4. Social & Political Sciences:
Social scientists and political analysts use data visualization to study public sentiment, voting behaviour,
and demographic changes. These insights are crucial for policy-making, electoral strategies, and media
analysis.
Key Visualization Methods:
210
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
iii. Time-Series Graphs: It examines system behaviour over time (e.g., sensor values within an IoT
network).
iv. 3D Surface Plots: These visualize intricate physical or chemical processes.
v. Histograms: They visualize frequency distributions of experimental data.
211
Bhumi Publishing, India
from IMDb, and knowing how different directors and languages affect the content on the streaming
service. The dashboard will be user-friendly, dynamic, and easy on the eyes.
Data visualization plays a vital role in contemporary digital interfaces, with the ability to facilitate users'
interaction with and exploration of large volumes of data. Netflix, being a top international streaming
platform, utilizes interactive data visualization methods to optimize user experience and data discovery.
This chapter discusses Netflix's real-time capabilities, mechanisms of user interaction, and visual
components, highlighting in fig 9.8; how this drive an intuitive and interactive data-centric interface.
Real-time features and user interactions:
212
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
3. Real-Time Data Updates:
Netflix constantly updates and refreshes its information in real-time to keep it accurate and current. This
feature is especially valuable for:
i. Trending content: The most recent popular television shows and films are updated according to
viewership metrics.
ii. Genre rankings: The ranking of genres is automatically modified by the system based on recent
viewing patterns.
iii. IMDb ratings: Any fluctuation in IMDb ratings is updated dynamically within the visualization.
Real-time updates guarantee that the latest information is always accessible, promoting users' and
analysts' decision-making.
4. Drill-Down Capability for In-Depth Exploration:
Netflix also features drill-down functionality, enabling users to click on visual components to get
additional data. The functionality is applied(shown in fig 9.10):
213
Bhumi Publishing, India
214
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
i. The number of movies and series released each year to monitor the growth of the industry.
ii. Filtering options so that users can analyze trends in terms of:
• Director: Display content created by directors.
• Genre: Evaluate which genres are on the rise or decline.
• Language: Examine content availability in various languages.
• IMDb Rating: Review the development of highly rated content.
215
Bhumi Publishing, India
216
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Chapter 10
CASE STUDIES ON REAL-WORLD DATA VISUALIZATION
Jasmeet Kaur1, Babita Sidhu2 and Jaskiran Ghotra3
1GNA University, Phagwara
2LKCTC, Jalandhar
3Guru Nanak institute of engineering & Management, Naushehra
10.1 Case Study 1: Google Flu Trends (Public Health Data Visualization):
Google Flu Trends (GFT) was a comprehensive public health visualization project initiated by Google in
2008 with the goal of monitoring and predicting flu epidemics in real time. The mechanism used Google
flu-related search terms as a barometer of the prevalence of the flu in varying geographic locations. Using
analysis of millions of global search queries, Google attempted to deliver lead indications of flu activity
in a complementary effort alongside other methods applied by public health authorities such as the
Centers for Disease Control and Prevention (CDC). Visualization from GFT stood out the most as it
projected flu activity through an interactive heat map on which densely searched regions in darker color
intensity were the pointers toward impending flu activity which is shown in the Fig 10.1
217
Bhumi Publishing, India
Nevertheless, with all its innovative spirit, the project did encounter some daunting challenges. By 2013,
studies revealed that GFT systematically overestimated flu activity, primarily because of variations in
search patterns shaped by media coverage, seasonal factors, and algorithmic biases. The absence of
integration with clinical data sources further compromised its predictive power. Due to this, Google shut
down the project in 2015, focusing on collaborative models with well-established public health
institutions.
For all its flaws, Google Flu Trends is an early trailblazer of public health visualization and the power of
big data analysis for disease surveillance. It illustrated how real-time digital trails were able to improve
epidemic surveillance, and how future developments in AI-based health forecasting could be considered
next. The case also underscores the need for data validation, interdisciplinary collaboration, and careful
interpretation of big data trends to ensure that predictive models are not only visually appealing but also
scientifically sound [41].
10.2 Case Study 2: New York Times – Election Data Visualization:
Election data visualization is important in the improvement of public knowledge of electoral results, voter
patterns, and political environments. The New York Times (NYT) is well known for its creative and
interactive election data visualizations that offer real-time information, comparative historical data, and
analytical insights on presidential, congressional, and local elections. The NYT method of visualizing
election data combines statistical precision, engaging narration, and interactive usability to provide
sophisticated electoral information broadly and engagingly [42].
Interactive Maps and Geospatial Representation:
One of the most characteristic aspects of NYT's election visualizations is its interactive geospatial maps,
which present election results at various levels, including:
• State level (for presidential and gubernatorial elections)
• County level (for detailed voting patterns)
• District level (for congressional elections)
218
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
These are color-coded, usually blue for the Democrats and red for Republicans, to mirror real-time vote
totals as shown in Fig 10.2. Interactive features enable the viewer to move their cursor over or click on an
area to access detailed figures, including overall votes, percentage of margin, and past voting behaviour.
This helps to better present election trends and allows the reader to make comparisons between current
and previous elections.
One of the strongest aspects of NYT's geospatial visualizations is their capacity to display electoral
changes across time. An example is that users can look at how particular counties or states have
politically changed from the past elections to the present, giving indications of swing states as well as
changes in voting behavior.
Real-Time Vote Tracking and Uncertainty Modeling:
Throughout elections, the NYT regularly refreshes its visualizations to match the most up-to-date vote
tallies and projections. Vote progress bars that indicate the share of ballots counted and live updates on
projected victors are included in their system. One innovation of their election coverage is the "election
needle", a live forecasting gauge that indicates predictions using arriving vote counts.
The election needle moves dynamically from one political party to another with the arrival of new data,
providing users with an estimate in real time of which candidate has the best chance of winning. But this
functionality has also caused controversy, since sharp fluctuations in predictions can be confusing or
create unnecessary anxiety among viewers. Nonetheless, the needle is a valuable illustration of how
statistical uncertainty can be graphically illustrated in election prediction.
Historical Comparisons and Trend Analysis:
In addition to real-time data, the NYT provides historical context by allowing users to compare election
results from different years. Their platform includes side-by-side electoral maps, showing how party
control has changed over multiple election cycles. They also offer trend lines and bar charts that illustrate
shifts in voter preferences based on factors like:
i. Demographics (race, gender, age, and education).
ii. Urban vs. rural voting patterns.
iii. Key policy issues influencing voter decisions.
This longitudinal analysis helps users understand broader political realignments, such as the increasing
suburban support for Democrats or the Republican dominance in rural areas. By providing historical
comparisons, the NYT allows readers to see elections not as isolated events but as part of a broader
political evolution.
User Engagement and Customization:
To make election data more accessible, the NYT offers customization features that allow users to tailor
their analysis based on specific interests. Readers can filter results by state, county, or district, explore
presidential, Senate, and House races separately, and use search functionality to find specific races or
candidates.
This interactive experience ensures that the visualization serves both casual readers and political analysts,
providing different levels of detail depending on the user's needs. The ability to drill down into granular
data enables deeper insights into local election dynamics, which can often be overshadowed by national
trends.
219
Bhumi Publishing, India
220
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Chapter 11
FUTURE TRENDS IN DATA VISUALIZATION
Manpreet Kaur1, Simran2 and Tarun Bhalla3
1,2GNA University, Phagwara
3Anand College of Engineering and Management, Kapurthala
Data visualization has changed dramatically over the past few years, fueled by technological
advancements, artificial intelligence, and growing data complexity. As companies continue to produce
enormous volumes of data, more advanced, interactive, and real-time visualization methods have become
a necessity. This chapter discusses future trends in data visualization, with emphasis on the use of AI,
interactive and real-time visualization, and best practices for data scientists and analysts.
11.1 The Role of AI in Data Visualization:
Artificial intelligence (AI) is revolutionizing data visualization by streamlining processes, improving
insights, and making data more accessible. The use of AI in data visualization offers several major
advancements:
1. Automated Data Analysis and Storytelling:
Visualization tools that are powered by artificial intelligence (AI) can examine datasets automatically,
recognize patterns, and produce insightful results. These tools facilitate the interpretation of intricate data
by delivering transparent visualizations without the need for extensive technical knowledge.
• Dashboards that are AI-powered offer relevant visualizations based on attributes of datasets.
• Natural Language Processing (NLP) allows people to pose questions in English and be provided
with instantaneous visual insight.
2. Predictive and Prescriptive Analytics:
AI aids in visualization through the use of predictive analytics, which predicts trends based on past
information. Further, prescriptive analytics provides a recommendation for action using real-time data.
• Machine learning algorithms model future possibilities for visualization to inform decision-
making.
• Anomaly detection through AI brings attention to aberrant trends or outliers without any
intervention.
3. Augmented Analytics:
Augmented analytics enabled by AI automates data preparation, insight generation, and explanation,
allowing organizations to make quicker and better-informed decisions.
• AI identifies causations and correlations in big data.
• Smart data summarization reduces big amounts of data into simple-to-consume visuals.
4. Personalized and Adaptive Visualizations:
AI enables dynamic and user-specific visualization that adapts to user behaviour and prefer-ences.
• AI-powered recommendation engines recommend optimal visual forms to various users.
• Adaptive dashboards alter their layout depending on the user's requirements and areas of interest.
221
Bhumi Publishing, India
222
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
223
Bhumi Publishing, India
• Add filtering, sorting, and drill-down capabilities to help users zoom into certain details.
• Utilize hover tips to give people more contextual information without messing up the main view.
• Provide customizability options so people can customize dashboards according to their
requirements.
4. Optimizing for Mobile and Cross-Platform Compatibility:
Data visualizations must be accessible on various devices and screen sizes.
• Make dashboards and reports responsive and perform well on mobile devices.
• Select lightweight visualization libraries that are compatible with multiple screen resolutions and
operating systems.
• Minimize loading times to avoid data rendering delays.
5. Incorporating AI-Powered Insights:
AI-powered analytics can enrich data visualization by delivering richer insights and predictive analytics.
• Use machine learning models to identify patterns and trends.
• Implement automated alerts to inform users of anomalies or substantial data changes.
• Utilize AI-driven recommendations to provide optimal visualization methods for datasets.
6. Promoting Ethical and Inclusive Data Visualization:
Ethics have a crucial part in ensuring that data visualizations are unbiased and inclusive for all users.
• Steer clear of biased representations that might mislead the audience or misinterpret results.
• Utilize colorblind-friendly palettes to make the visualizations accessible for visually impaired
users.
• Offer alternative descriptions of visual information to support differently abled users.
• Be transparent by explicitly listing assumptions, sources of data, and any limitations.
11.4 Conclusion:
The future of data visualization is on the cusp of radical transformation by AI, real-time processing, and
immersive technology. As businesses and sectors keep adopting sophisticated visualization methods, the
work of data analysts and scientists will become even more vital in ensuring accuracy, reliability, and
ethical usage of data.
By embracing AI-driven automation, interactive narration, and adaptive visual composition, professionals
can craft more informative and interactive data visualizations. Further, as ethical and accessibility
considerations take centre stage, the emphasis must be placed on crafting inclusive and transparent
visualizations that reach various audiences.
Finally, the secret to the future of data visualization is balancing technology with human insight. With the
help of cutting-edge tools and best practices, data professionals can guarantee that visualizations are
effective decision-making, communication, and knowledge-discovery instruments for domains ranging
from business to science.
224
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Chapter 12
IMAGE ENHANCEMENT TECHNIQUES IN DIGITAL IMAGE PROCESSING
Bhoomi Gupta1, Gagandeep Singh2 and Sumit Chopra3
1,2,3GNA University, Phagwara
12.1 Introduction:
Digital Image Processing (DIP) is now an essential tool for contemporary technological and scientific
endeavours. A wide range of methods is employed to carry out operations on digital images, with the
overall aim of improving image quality or extracting useful information as shown in Fig.12.1. In the
majority of fields of medical diagnostics, aerospace, industrial inspection, surveillance, and remote
sensing, use of DIP is not only central but also fast changing with advances in computational capabilities
and algorithmic advancements [43].
Of all the DIP operations, image improvement is of fundamental importance. It is the operation of
transforming an image in such a way that it is suitable for a specific application or to improve its visual
impact. The improvement techniques are not intended to add new information to the image, but rather to
render existing information accessible to both human observers and machines.
225
Bhumi Publishing, India
scanners, medical imaging devices, or satellites, are prone to different degradations. They might be too
faint, defocused, noisy, non-homogeneous in contrast, or sensed in poor capture conditions. Those
degradations can hide crucial information and have an effect on human and machine perception of an
image [51].
In order to correct these issues, image enhancement techniques are applied to alter or enhance the image
in a way that it becomes more suitable for visual examination or further computing processing.
Improvement objectives can be categorized into three broad categories:
i. Improving Human Visual Perception: One of the primary goals of image enhancement is to
produce an image that is more perceptually informative or more visually pleasing to the human
eye. Human eyes are naturally contrast, brightness, edge, and color distribution sensitive. The
enhancement algorithms try to emphasize these perceptual features so that observers can better
interpret or understand the image contents as given in Fig.12.2 [42].
226
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Lastly, the goal is to provide an image that promotes intuitive comprehension, correct judgment,
and improved visual attractiveness.
ii. Facilitating Automated Analysis: As computer vision systems and AI algorithms increasingly
become part of image analysis, images need to be improved in ways that maximize their fitness
for automated analysis. Machines contrast images on the basis of numerical features, including
pixel intensities, gradients, textures, and shapes, rather than subjective visual attractiveness [42].
Enhancement in this aspect entails:
• Noise reduction to prevent false alarms.
• Improving edge sharpness to enable object detection or segmentation.
• Improving feature perception to enable pattern recognition or clustering.
Example Applications:
• In industrial automation, sophisticated imaging enables the detection of micro-cracks,
misalignment, or contamination on products on a production line.
• Deblurring and contrast enhancement improve vision systems in autonomous vehicles to enable
them to recognize road signs, lane markings, or pedestrians better.
• In agriculture, improved drone images are utilized to estimate the health of crops or detect pest-
infested regions through spectral imaging.
By improving the clarity and coherence of visual information, image enhancement enhances the accuracy,
efficiency, and reliability of machine-dependent tasks.
iii. Highlighting Specific Features: In certain uses, one is not attempting to enhance the entire
image as a composite but to enhance selectively certain individual features which are of greatest
usefulness in the current application. These can be:
• Edge enhancement to make edges and outlines more prominent.
• Sharpening texture for enhancing surface contrasts.
• Region-of-interest (ROI) brightening to bring out a particular region of interest in the
image and decrease the background.
Such targeted improvement is particularly valuable in circumstances where:
• Small things (i.e., hairline breaks on X-rays or tiny imperfections on a circuit board) are worth
examining.
• Some of the regions possess diagnostic or decision-critical information that must be separated and
developed to facilitate accurate assessment.
Methods commonly employed for feature-specific enhancement are:
i. Unsharp masking: Unsharp masking is a common and old image improvement algorithm that is
employed to render an image sharper by enhancing its edges. Interestingly, it renders the output
image sharper than the original one, and its "unsharp" designation is due to the analog-
photographic origin of this technique, given in Fig.12.3 [44].
227
Bhumi Publishing, India
228
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
How it Works:
• A convolution kernel (mask) is applied to the image where the center pixel has a large positive
value, and the surrounding pixels have negative or zero values.
• This emphasizes regions with large intensity differences, effectively sharpening the image.
Example Kernel:
−1 −1 −1
[−1 8 −1]
−1 −1 −1
Effect:
• Enhances borders and edges
• Reduces the smooth areas
• Can increase noise if not used carefully
Applications:
• Edge detection
• Text recognition in documents
• Feature enhancement in satellite or microscopic images
iii. Gradient-based methods: Gradient-based improvement techniques are concerned with edge
detection by calculation of the first-order derivatives (gradients) of the image intensity function.
The magnitude of the gradient represents the strength of an edge, whereas the direction of the
gradient indicates the orientation of the edge, as shown in Fig.12.5.
𝝏𝑰∕𝝏𝒚
Gradient direction: 𝜽(𝒙, 𝒚) = 𝒕𝒂𝒏−𝟏 (𝝏𝑰∕𝝏𝒙)
Popular operators used to approximate the gradient:
229
Bhumi Publishing, India
• Sobel Operator: Sobel operator, or Sobel–Feldman operator or Sobel filter in certain cases, is
applied in computer vision and image processing, especially in edge detection algorithms where it
generates an image with the edges highlighted. The Sobel operator is a discrete differentiation
operator that computes an approximation of the gradient of the image intensity function. It is both
differentiation and smoothing, and therefore more resistant to noise than the more primitive
operators.
The Sobel–Feldman operator relies on convolving the image with a small, separable, and integer-
valued filter in the horizontal and vertical directions and is thus extremely cheap computationally.
The gradient approximation it generates is, however, extremely coarse, especially for high-
frequency changes in the image.
Formulation: The operator uses two 3×3 kernels that are convolved with the input image to
calculate approximations to the derivatives – a horizontal and a vertical. If A is the input image,
and Gx and Gy are two images which at each point have the horizontal and vertical derivative
approximations respectively, the calculations are as follows:
The x-coordinate is defined here as increasing in the "right" direction, and the y-coordinate is defined as
increasing in the "down" direction. At each point in the image, the resulting gradient approximations can
be combined to give the gradient magnitude, using Pythagorean addition:
where, for example, 𝚯 is 0 for a vertical edge, which is lighter on the right side.
230
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Example:
The output of the Sobel–Feldman operator is a 2D gradient map at each point. It can be processed and
displayed as if it were an image, with the areas of high gradient (the probable edges) as white lines. The
next images demonstrate this by displaying the Sobel–Feldman operator calculation on a basic image
given in Fig.12.6 [46].
Grayscale image of a black Circle with The direction of the Sobel operator’s
a white background Gradient
Fig. 12.7: Illustrate the change in the direction of the gradient on a grayscale circle
231
Bhumi Publishing, India
• Prewitt Operator: The Prewitt Operator is an old edge detection technique used in digital image
processing to detect edges by estimating the first-order spatial derivatives of an image.
Functionally, it seeks to highlight edges in the horizontal and vertical directions to look for steep
intensity changes, which will usually be object boundaries or features [47].
The Prewitt operator offers a less complex alternative to the Sobel operator and is valued for its
computational elegance and ease of implementation, especially in contexts where real-time
processing is crucial, as shown in Fig.12.8.
𝑮 = √𝑮𝟐𝒙 + 𝑮𝟐𝒚
232
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
The gradient direction (θ) can also be calculated using:
𝑮𝜸
𝜽 = 𝒕𝒂𝒏−𝟏 ( )
𝑮𝒙
This provides the orientation of the edge at each pixel.
Advantages:
1. Simple and efficient: It uses coefficient, which are computationally in expensive.
2. Directional Sensitivity: It can detect bot horizontal and vertical edges independently.
3. Useful for Real-Time Systems: Due to its low computational complexity, it is useful in real-time
systems.
Limitations:
1. Less accurate edge localization: Compared to Sobel, it may produce less defined edges.
2. No smoothening component: Unlike the Sobel operator, the Prewitt filter does not give extra
weight to the center row or column, making it more sensitive to noise.
3. Limited Diagonal Detection: It is less effective at detecting diagonal edges unless additional
filters are designed.
Applications:
1. Edge detection in document scanning: For locating lines, text boundaries and margins.
2. Computer Vision Systems: For detecting object boundaries and feature extraction in real-time.
3. Medical Image Processing: Identifying region contours or organ boundaries.
4. Traffic and Surveillance Cameras: Basic feature extraction under constrained resources.
Example:
Consider a 3×3 section of a grayscale image:
𝟏𝟎𝟎 𝟏𝟎𝟎 𝟏𝟎𝟎
[𝟏𝟓𝟎 𝟏𝟓𝟎 𝟏𝟓𝟎]
𝟐𝟎𝟎 𝟐𝟎𝟎 𝟐𝟎𝟎
Applying the Gy filter (vertical gradient) would yield a strong response because there is a
significant change in intensity from top to bottom (from 100 to 200), indicating a horizontal edge.
• Roberts Cross Operator: Roberts Cross Operator is one of the oldest and most straightforward
edge detection digital image processing algorithms. It is utilized for the first-order derivative
approximation of the image intensity function gradient. The process is highly efficient for the
detection of diagonal direction (45°) edges and is also computationally simple due to its compact
kernel size as shown in Fig.12.9.
233
Bhumi Publishing, India
It was created by Lawrence Roberts in 1963, this operator is still studied for its historical importance and
for its ability to detect edges at low computational expense [48].
Working Principle:
The Roberts Cross Operator does edge detection through the computation of the difference between
diagonally neighbouring pixel values within a 2×2 image neighbourhood. This small window gives the
operator very low overhead and is very fast, suitable for hardware-constrained systems or real-time
systems.
Two convolution masks (kernels) are employed to compute gradients in orthogonal diagonal directions:
i. Robert Cross Kernal: Let I(x,y) be the intensity at pixel (x, y). The two gradient approximations
are computed using the following 2×2 masks:
Gradient in the X-direction (Gx):
+𝟏 𝟎
𝑮𝒙 = [ ]
𝟎 −𝟏
Gradient in the y-direction (Gy):
𝟎 +𝟏
𝑮𝒚 = [ ]
−𝟏 𝟎
Each kernel is applied by placing the top-left corner of the kernel on a pixel and computing the sum
of the products of the overlapping values. These kernels effectively compute the intensity
differences across the diagonals.
ii. Gradient Magnitude and Direction:
Once the two directional gradient Gx and Gy are calculated, the magnitude of the gradient at each
pixel is determined using:
𝑮 = √𝑮𝟐𝒙 + 𝑮𝟐𝒚
234
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
235
Bhumi Publishing, India
Kernel 5X5
236
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
• Gaussian Filter: This filter uses a kernel with values derived from the Gaussian function. It
assigns higher weights to central pixels and progressively lower weights to distant ones. This
approach produces a smoother, more natural blurring effect and is more effective at preserving
edge information compared to the average filter as shown in Fig.12.14.
5X5 Gaussian Kernel
• Adaptive Filter: Unlike fixed-kernel filters, adaptive filters adjust their behaviour based on local
image characteristics. They apply stronger smoothing in noisy areas and minimal filtering in areas
with significant detail. Adaptive filters use statistical measures like local variance or intensity
range to adapt the kernel response, preserving important features while reducing unwanted noise.
Smoothing filters are often applied as a precursor to other image processing operations to ensure
consistency and to minimize the impact of noise on subsequent analysis. However, care must be taken to
balance noise reduction and detail preservation, as excessive smoothing can lead to information loss.
12.1.2.2 Sharpening Filters:
Sharpening filters are used to enhance the edges and fine details in an image by increasing the contrast
between adjacent pixels with differing intensities. Unlike smoothing filters, which aim to suppress high-
frequency components (noise), sharpening filters accentuate these components to make image features
more pronounced.
Sharpening is particularly useful in applications requiring detailed visual analysis, such as medical
imaging, satellite photography, and industrial inspection. It is commonly achieved by applying derivatives
to detect rapid changes in intensity [43].
237
Bhumi Publishing, India
238
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Mathematically, contrast stretching can be expressed as:
Where:
• I(x,y) is the original image intensity at location (x,y).
• Imin, Imax are the minimum and maximum intensity values in the input image.
• Nmin, Nmax define the new desired range (e.g., 0 to 255).
• I’(x,y) is the enhanced pixel intensity.
This transformation stretches the input intensity values linearly across the desired output range. As a
result, features that were previously indistinguishable due to low contrast become more prominent.
However, contrast stretching is a global technique, which means that the same transformation is applied
across the entire image. While this works well in many cases, it may not be effective if contrast variation
is localized. In such cases, local contrast enhancement techniques or adaptive histogram equalization may
yield better results [46].
12.1.3 Frequency Domain Methods:
Frequency domain methods enhance images by modifying their frequency content rather than directly
manipulating individual pixel values. This is accomplished by transforming the image into the frequency
domain using mathematical tools like the Fourier Transform, processing the transformed data, and then
converting it back into the spatial domain , shown in Fig.12.17.
239
Bhumi Publishing, India
After applying a filter H(u,v) , the filtered image in the frequency domain becomes:
G(u,v) = H(u,v) . F(u,v)
Then, the inverse transform retrieves the spatial image:
240
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
12.1.4.1 Principles of Fuzzy Logic in Image Processing
Fuzzy logic deals with reasoning that is approximate rather than fixed and exact. In the context of image
processing, fuzzy logic enables the system to make decisions based on degrees of truth rather than the
conventional true or false logic used in classical digital systems. This leads to better adaptability and
tolerance to variations in input data [49].
Fuzzy domain enhancement typically involves three main steps:
1. Fuzzification
• In this step, the pixel intensity values of the input image are transformed into fuzzy
values using membership functions.
• Each intensity level is assigned a degree of belonging to different fuzzy sets (e.g., dark,
medium, bright).
• Common membership functions include triangular, trapezoidal, and Gaussian shapes.
2. Membership Function Modification
• Enhancement rules are applied to the fuzzy values to modify the degree of membership.
• These rules can be based on expert knowledge or adaptive algorithms and aim to improve
specific image characteristics such as contrast or brightness.
• For instance, rules may amplify the membership of brighter pixels in high-intensity fuzzy
sets to make bright areas more prominent.
3. Defuzzification
• The adjusted fuzzy values are converted back to crisp pixel intensities.
• Techniques such as the centroid method or maximum membership principle are used.
• The resulting image has enhanced features with improved visibility and clarity.
12.1.4.2 Mathematical Representation:
Let f(x,y) be the intensity at pixel (x,y). The fuzzification process maps this intensity to a fuzzy set using a
membership function µ(f(x,y)). After enhancement via fuzzy rule application (represented as
transformation T ), the defuzzified output g(x,y) is obtained:
g(x,y) = D[T(µ(f(x,y)))]
Where D represents the defuzzification function.
12.1.4.3 Advantages of Fuzzy Domain Techniques
• Adaptability: Can dynamically adjust enhancement based on local image characteristics.
• Robustness: Effective in handling images with varying levels of noise or indistinct features.
• Visual Quality: Produces images that align more closely with human perception.
12.1.4.4 Limitations
• Computational Complexity: Fuzzy systems are generally more resource-intensive than
traditional enhancement methods.
• Parameter Tuning: Requires careful design and calibration of membership functions and rules.
• Scalability: May need adaptation for large-scale or high-resolution image processing tasks.
12.1.4.5 Applications
• Medical Imaging: Enhancing soft tissues in MRI or ultrasound images where boundary
definitions are subtle.
• Remote Sensing: Handling satellite images with mixed terrain categories and fuzzy land-cover
boundaries.
241
Bhumi Publishing, India
242
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Chapter 13
MULTI-POSE GUIDED VIRTUAL TRY-ON SYSTEMS (MPGVTOS)
Gurnoor Singh1, Sumit Chopra2 and Gagandeep Singh3
1,2,3GNA University, Phagwara
13.1 Abstract:
The abstract provides an introduction to the growing significance of Virtual Try-On (VTO) systems in the
ever-changing e-commerce environment, more so in the fashion retailing sector. As consumers continue
to move towards shopping online, the greatest challenge remains the inability to try on apparel physically,
potentially resulting in ambiguity about fit, style, and overall look. Classic VTO solutions try to solve this
shortcoming, but they struggle because they are based on fixed poses and little flexibility, necessitating
manual adjustments to provide a slightly realistic experience. In an effort to address these limitations, this
chapter introduces a new paradigm—Multi-Pose Guided Virtual Try-On System (MPGVTOS).
MPGVTOS greatly improves the experience of virtual shopping by providing realistic and dynamic view
of garments that self-adapt to varied human poses. This ability not only enhances user interaction but also
enhances confidence in buying decisions by better simulating how clothing would look in actual life. For
companies, MPGVTOS supports higher conversion rates and lower return rates, which result in higher
customer satisfaction and a more sustainable shopping model. In addition, the design of the system is
future proof with possibilities of further integration with emerging technologies like virtual reality (VR),
footwear visualization, and real-time fabric simulation. Such advancements make MPGVTOS a
revolutionary tool in shaping the way people interact with fashion online.
13.2 Introduction:
The introduction establishes the context of the fashion industry’s digital transformation, focusing
specifically on the role of Virtual Try-On (VTO) systems. As online shopping becomes increasingly
popular, there is a rising demand for tools that bridge the gap between digital and physical shopping
experiences. VTOs serve this purpose by enabling customers to visualize how clothes might look on them
using technologies like computer vision, augmented reality (AR), and machine learning (ML) [52].
However, many of the current VTO systems are limited in that they only work with static images. They
cannot adjust dynamically to a user's varying poses, which significantly reduces the realism and appeal of
the experience. This shortfall makes it harder for consumers to make confident purchasing decisions and
increases the likelihood of returns. To address these limitations, the Multi-Pose Guided Virtual Try-On
System (MPGVTOS) is introduced. MPGVTOS brings a new dimension to VTO by incorporating
dynamic pose adaptation, allowing garments to be visualized in sync with various human stances and
movements. This improvement not only enhances interactivity and realism but also builds greater
confidence in online purchases. Furthermore, the integration of powerful AI technologies such as
SnapML and MobileNet with AR development platforms like Lens Studio enables the creation of rich,
immersive, and efficient virtual try-on environments. The introduction positions MPGVTOS as a cutting-
edge solution capable of redefining the online fashion retail experience, as shown in Fig.13.1 [53].
243
Bhumi Publishing, India
244
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
controlled environments but tend to struggle with the variability and diversity needed for large-scale
deployment. Another constraint is hardware compatibility. Most VTO systems are computationally
demanding, using high-end machines and powerful GPUs, which makes them incompatible with use on
regular smartphones or low-end consumer devices. In addition, such systems usually cannot cope with
real-world variability like unreliable lighting conditions, mixed backgrounds, and non-uniform body
shapes or poses. These challenges make the virtual try-on experience less reliable and consistent for end
users [56].
One of the more viable but challenging solutions within the domain comes from hybrid approaches,
including those presented by Van Duc et al. (2023) [53], that merge methods like head swapping and
body reshaping. These systems are able to create highly realistic images by altering facial direction to fit
the desired pose and remodeling the body to how varying garments would fit. Yet, the computational
needs of such models are very high. They require high-end processors and a lot of memory, which
restricts their use and feasibility for general consumer use, particularly on mobile devices.
Aside from technical limitations, there are also significant social and ethical factors that influence the use
of VTO systems. Privacy is the primary concern since users may be required to expose private images or
enable camera feeds for the applications to function optimally. This open concerns regarding data storage,
stakeholders with access to data, and whether the proper consent mechanism is present. Secondly, the
cultural setting within which VTO technology is applied largely contributes to the acceptance of users. To
some cultures, virtual undressing or body scanning may be considered intrusive or inappropriate, which
constrains marketability of such systems in some parts [57].
With such problems in mind, there is a pressing need for next-generation VTO solutions that are context-
dependent, lightweight, and user-focused. These systems need to be optimized to run efficiently on a
broad spectrum of devices, safeguard user data, be culturally sensitive, and provide high levels of
interactivity and realism. The new Multi-Pose Guided Virtual Try-On System (MPGVTOS) is intended to
address the gap in these issues through a smart combination of AR, ML, and optimal system design[59].
13.4 Evolution and Impact of Virtual Try-On (VTO) in Fashion E-Commerce:
Evolution of Virtual Try-On (VTO) technology in fashion online retailing has been an evolutionary
process driven by technological, market forces, and global retailing behavior changes. The early VTO
technologies were basic, based on 2D static overlays of clothing on top of users' uploaded images. They
were not depth-enabled, lacked dimensional precision, and were not interactive. Customers would
typically experience constraints like clothing misfit, improper size estimations, and unrealistic fit, leading
to dissatisfaction and continued use of conventional shopping.
The COVID-19 pandemic was perhaps one of the most critical junctures for VTO, as it significantly
changed the way consumers shopped. With physical retail stores having to close or restrict access,
contactless shopping solutions became a necessity rather than an option. Virtual try-on technologies
quickly developed as a hygienic, safe, and interactive substitute for conventional fitting rooms. This
triggered mass investment by retailers and technology firms to simulate the experiential touch of in-store
buying using digital media [58].
Studies performed throughout the pandemic years emphatically attest to the beneficial impact of VTO on
customer behavior and commercial performance. Research has indicated that VTO systems increase
customer confidence by providing a better understanding of product fit and fashion. This online
interaction provides a psychological feeling of ownership and fulfillment, which makes the customer feel
245
Bhumi Publishing, India
more confident in their purchasing decisions. Therefore, VTO technologies help to increase the rates of
conversion and decrease the rates of product return, overcoming two of the most recurring issues in e-
commerce.
Recent VTO platforms are now integrating state-of-the-art technologies like augmented reality (AR), 3D
body scanning, deep learning, and computer vision. They facilitate real-time garment simulation, more
accurate capturing of user body measurements, and personalized fitting. For example, AR-based
technology can overlay 3D garments on a real-time video capture of the user, creating an interactive
preview of how the clothing drapes on the body shown in Fig.13.2.
Fig. 13.2: The interactive buttons on top left and right helps user to switch between the clothing
with a hand-free experience
Even with such advancements, seamless realism and usability remain challenging. One of the key
limitations is pose variability—most systems find it difficult to properly fit the clothing to different body
poses, resulting in distortion or unrealistic fitting when the users move. Another area that tends to be
underdeveloped is facial integration and personalization, which can affect the perceived level of realism,
especially in fashion categories such as eyewear, hats, or upper clothing. Furthermore, true digital
capturing of fabric qualities, including texture, drape, stretch, and transparency, still poses a technological
challenge given variability in material performance under contrasting conditions of light and motion.
These issues serve to highlight the necessity for ongoing innovation and optimization in VTO systems.
Improving these areas will not only enhance virtual try-on experience realism and user satisfaction but
will also increase brand credibility, lower logistical costs associated with returns and aid the fashion
industry's sustainability objectives by reducing product return waste[59].
13.5 What are Generative Adversarial Networks (GANs)?
Generative Adversarial Networks (GANs) is a type of machine learning models suited for unsupervised
learning applications, invented by Ian Goodfellow et al. in 2014. GANs are formed by two neural
networks — the Generator (G) and Discriminator (D) — both trained together at once in game-theoretic
form. Both the networks race against each other, pushing each other to optimize in the course of time,
given in Fig.13.3.
246
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
The ideal outcome is when the Generator produces such high-quality outputs that the Discriminator
cannot distinguish them from real data, i.e., outputs are indistinguishable from real samples.
13.5.2 Use of GANs in 3D Garment Simulation and Fashion AI:
3D garment simulation involves replicating the visual appearance, fit, texture, and physics of clothing
in a digital space. Traditional methods rely on physics-based simulation, which can be computationally
expensive. GANs offer a data-driven alternative to generate high-quality, realistic garments with faster
computation given in Fig.13.4 [52].
247
Bhumi Publishing, India
248
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
6. Physics-Guided GANs:
• Recent advances incorporate physics priors into GAN training to ensure the generated
garments obey realistic motion and material constraints (e.g., stretching, gravity,
collisions).
• These hybrid models balance visual realism and physical plausibility.
13.6 System Architecture of MPGVTOS:
The architecture of the Multi-Pose Guided Virtual Try-On System (MPGVTOS) is a multi-level
framework that combines machine learning, computer vision, and augmented reality technology to
provide an end-to-end interactive try-on experience. The architecture is optimized to provide real-time
response, pose flexibility, and photorealistic visual representation of garments even on mobile devices. It
has various fundamental components, each of which plays a distinctive role in the virtual try-on process
[63].
A. Augmented Reality Framework: Lens Studio Integration:
The Augmented Reality (AR) Infrastructure of the Multi-Pose Guided Virtual Try-On System
(MPGVTOS) is constructed on top of Lens Studio, a mature and developer-oriented platform created by
Snap Inc. for building rich AR experiences. Lens Studio offers an extensive framework for incorporating
machine learning models, 3D content, gesture control, and physics simulations into desktop and mobile
apps. Its feature set makes it well-fitted for fashion retail applications that require real-time response and
visual realism [64].
Lens Studio is an AR development platform that allows creators to develop and distribute interactive
"lenses" — AR filters and experiences — that are mostly designed for use on Snapchat. It has support for
both 2D and 3D asset rendering, object tracking, and gesture detection, and it can be used with SnapML,
which supports machine learning models to be directly inserted into AR experiences.
For virtual try-on systems, Lens Studio is beneficial because:
i. It facilitates mobile light deployment.
ii. It has integrated real-time rendering to support interactive clothing visualization.
iii. It provides a straightforward scripting platform via JavaScript and graphical node-based
programming.
Lens Studio provides real-time tracking and rendering capabilities, making it suitable for mobile AR
applications where responsiveness and realism are critical.
B. Machine Learning Integration: SnapML and ONNX Models:
The machine learning that drives MPGVTOS is fueled by SnapML (Snap Machine Learning), which
allows for the embedding of trained ML models into the AR world. The framework accommodates model
types such as ONNX (Open Neural Network Exchange), TensorFlow Lite, and Core ML, providing
flexibility and platform portability.
Some of the main functionalities driven by machine learning are:
i. Object detection (through CenterNet): Finds the user's body, limbs, and face in the frame for
precise garment alignment.
ii. Pose estimation: Recognizes user pose to allow dynamic adjustment of clothing.
iii. Facial recognition (through Viola-Jones): Provides garment fit and continuity, particularly for
accessories or upper-body garments.
iv. Gesture classification: Recognizes and interprets hand gestures for user interaction.
249
Bhumi Publishing, India
These models are trained and optimized for lightweight performance to provide seamless operation on
mobile devices.
C. Model Development Pipeline:
The training of ML models for MPGVTOS is done through a systematic training pipeline:
i. Dataset Preparation:
1. A portion of the COCO dataset is utilized, with particular emphasis on human figures, poses, and
types of garments.
2. Data augmentation methods (scaling, rotation, flipping) enhance robustness.
ii. Model Training:
1. MobileNet v2 is used as the backbone neural network for feature extraction because it is efficient
and accurate for mobile use.
2. CenterNet architecture is employed for center point detection of objects and size and pose
estimation from central points, minimizing computation while preserving accuracy.
iii. Model Export and Conversion:
1. Trained models are exported into ONNX format for SnapML compatibility.
2. Models are then imported into Lens Studio and associated with the AR interaction logic.
D. 3D Garment Design and Simulation:
Clothing assets are created in Blender, an open-source powerful 3D modeling tool. The garments are:
1. Sculpted to fit real-world measurements and fabric behaviors.
2. Textured with realistic fabric patterns and colors.
3. Exported in FBX format, which is AR deployable using Lens Studio.
Advanced physics-based parameters, like gravity, motion dynamics, and collision matrices, are set up to
create realistic draping and garment response to body movement.
E. Interaction Design and Gesture Control:
For better usability, MPGVTOS supports hands-free interaction means:
1. Hand-tracking modules recognize when users point towards AR interface buttons.
2. Dwell time detection (i.e., holding the hand over the button for 2–3 seconds) is utilized as a
trigger to toggle through outfits or accessories.
3. This design avoids touch-based inputs, making it more accessible and immersive.
F. Real-Time Rendering and Feedback Loop:
The system processes frames in 14–17 milliseconds, enabling real-time interaction even on mid-range
smartphones. The performance loop includes:
1. Image capture and analysis.
2. Pose and face detection.
3. Garment mesh alignment.
4. Visual rendering and user feedback.
This loop is continuously updated to ensure that the garments remain anchored correctly to the user’s
movements and adapt to pose changes fluidly.
13.7 3D Garment Modeling:
1. Blender for 3D Modeling: MPGVTOS uses Blender, a powerful open-source 3D creation tool,
to design digital garments. Each item of clothing is modeled from scratch using polygonal meshes
250
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
that define the shape, volume, and structure of the garment. This allows precise control over
garment dimensions and tailoring [66].
2. Texturing and Shading: After the basic structure is created, designers apply realistic fabric
textures and colors. UV mapping is used to accurately project 2D textures onto the 3D surface.
Blender’s shader nodes help simulate material properties such as glossiness, transparency, and
softness, making fabrics look authentic (e.g., denim vs. silk).
3. Rigging for Dynamic Fit: Rigging involves binding the 3D garment to a skeletal structure
(armature) so that it moves naturally with the user's body. Weight painting is used to define how
different garment sections respond to joint movements, allowing for natural stretching and
bending [67].
4. Physics-Based Simulation: Cloth physics are configured to replicate real-world behavior under
gravity and motion. These include collision detection with the user’s body, draping effects, and
interaction between multiple layers of clothing, enhancing realism.
5. Export to FBX Format: The final 3D models are exported in the FBX format, which retains all
geometry, textures, and rigging information. This format is compatible with AR platforms like
Lens Studio, where the garments are rendered in real time on users during the virtual try-on
process.
13.8 Machine Learning Workflow:
1. Dataset Preparation with COCO: The system leverages a curated subset of the COCO
(Common Objects in Context) dataset focused on human figures and wearable items. This rich
dataset provides labeled images that help the model learn to identify human anatomy, poses, and
clothing types under varied conditions.
251
Bhumi Publishing, India
252
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
Explanation: Many budget or older smartphones may struggle with rendering high-resolution clothing
models or performing live pose estimation. The inclusion of advanced features like fabric dynamics,
shading effects, and real-time physics simulations can further strain device capabilities, resulting in
slower frame rates or thermal throttling. This makes wide-scale deployment across various device types a
challenge, unless further model compression or cloud-based computation is implemented.
4. Limited Realism in Fabric Behavior:
Key Issue: While the garments are visually realistic, simulating fabric behavior under motion, gravity,
and environmental effects (e.g., wind or collisions) still lacks complete accuracy.
Explanation: Different materials—such as silk, denim, leather, or lace—react uniquely to movement and
external forces. Accurately modeling these responses in real time requires physics-based cloth simulation
engines that factor in fabric weight, elasticity, and resistance. Most current VTO systems, including
MPGVTOS, use simplified physics approximations which can compromise realism, especially when
garments are layered or in motion.
5. Gesture-Based UI Limitations:
Key Issue: Although hands-free gesture controls enhance interactivity, they are often sensitive to
environmental factors such as lighting conditions, background noise, or camera quality.
Explanation: In low-light environments or when using low-resolution front-facing cameras, hand
gestures may not be accurately detected. Additionally, unintentional movements or delays in gesture
recognition can lead to poor user experience or frustration. Robust gesture recognition requires
sophisticated algorithms that account for spatial and temporal data, potentially increasing the system’s
complexity and resource usage.
6. Privacy and Ethical Concerns:
Key Issue: The use of facial and body tracking technologies raises privacy concerns among users.
Explanation: Virtual try-on systems that involve real-time video processing and body detection often
require access to sensitive visual data. Users may be hesitant to grant camera access or uncomfortable
with how their visual data is stored or processed, particularly if the system lacks transparent data handling
policies. Additionally, developers must consider the ethical implications of body image representation,
ensuring that the system does not inadvertently promote unrealistic beauty standards or bias toward
certain body types.
7. Lack of VR Integration:
Key Issue: MPGVTOS currently operates within a 2D or AR interface without integration into fully
immersive virtual reality (VR) environments.
Explanation: Although AR provides an interactive experience, VR would offer a more immersive and
lifelike try-on experience, enabling users to walk around, inspect garments from multiple angles, and even
interact with virtual store environments. However, VR implementation introduces its own challenges,
including the need for specialized hardware, motion tracking, and higher system requirements.
13.10 Conclusion:
The advent of Multi-Pose Guided Virtual Try-On Systems (MPGVTOS) represents a breakthrough in the
confluence of fashion technology, augmented reality, and artificial intelligence. With e-commerce
reshaping the retail industry, consumers increasingly crave experiences that span the divide between
physical and digital shopping. MPGVTOS answers this demand directly by allowing users to interact with
253
Bhumi Publishing, India
garments within a virtual environment in real-time, providing a natural, dynamic, and interactive try-on
experience that was heretofore impossible with standard 2D or static VTO systems [70].
This chapter has provided an in-depth introduction to MPGVTOS—its motivations, structure, underlying
technologies, and applications. Through the combination of 3D garment modeling via Blender, machine
learning-based pose recognition using MobileNet v2 and CenterNet, and real-time AR rendering via Lens
Studio, MPGVTOS provides a seamless, highly responsive interface that dynamically adjusts to varied
human postures and offers a realistic simulation of fabric behavior and garment fit. The system’s use of
gesture-based controls and real-time feedback mechanisms further enhances user engagement, making it a
promising tool for both online retailers and fashion consumers.
Notably, the system does not merely offer visual gratification. MPGVTOS remedies some of the ongoing
issues of online clothing purchases—like size misrepresentation, high return rates, and customer
reluctance due to the absence of haptic feedback. By presenting a try-on solution that is both visually
engaging and functionally responsive, the system can potentially enhance customer confidence, enhance
conversion rates, decrease product returns, and help achieve more sustainable retailing practices [71].
Still, there are challenges to implementing MPGVTOS. These include self-occlusion, computational
complexity, limited datasets, and realism of fabric simulation that need to be overcome so that wider use
and user acceptance are enabled. Also critical to developing virtual try-on systems as equitable and
universally acceptable are concerns regarding privacy, accessibility, and cultural acceptability. In
addition, the system's present AR-based deployment, though effective, would be substantially improved
upon by integrating with full-immersion virtual reality (VR) environments to create an enriched, more
immersive shopping experience.
In the future, the potential of VTO systems such as MPGVTOS depends on further innovation and cross-
disciplinary work. By integrating state-of-the-art computer vision, deep learning, and real-time AR/VR
technologies with human-centered design, researchers and developers can design next-generation virtual
shopping platforms that are not only technologically superior but also inclusive, ethical, and strongly
aligned with consumer requirements.
In summary, MPGVTOS is not just a technical product; it is a window into the future of online fashion
shopping, where technology facilitates creativity, ease, and confidence in the buying process. With the
changing industry, technologies such as MPGVTOS will become pivotal factors in driving the way
fashion is experienced, personalized, and consumed in a digitally enabled world.
254
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
REFERENCES:
1. https://www.tableau.com/visualization/what-is-data-visualization.
2. https://www.geeksforgeeks.org/data-visualization-and-its-importance/.
3. Aigner, W., Miksch, S., Schumann, H., & Tominski, C. (2011). Visualization of Time-Oriented
Data. Springer Science & Business Media.
4. Mack C, Su Z, Westreich D.Rockville (MD): Agency for Healthcare Research and Quality (US);
2018 Feb.
5. Bertin, J. (1983). Semiology of Graphics: Diagrams, Networks, Maps. University of Wisconsin
Press.
6. Bremer, N., & Wu, S. (2018). Data Sketches. CRC Press.
7. Cairo, A. (2016). The Truthful Art: Data, Charts, and Maps for Communication. New Riders.
8. Cairo, A. (2019). How Charts Lie: Getting Smarter about Visual Information. W. W. Norton &
Company.
9. Camoes, J. (2016). Data at Work: Best practices for creating effective charts and information
graphics in Microsoft Excel. New Riders.
10. Cederbom, C. (2023). Data Analysis and Visualization: A Complete Guide - 2023 Edition. The Art
of Service.
11. Murray, S. (2017). Interactive Data Visualization for the Web (2nd ed.). O'Reilly Media.
12. Cleveland, W. S. (1994). The Elements of Graphing Data. Hobart Press.
13. Dykes, B. (2016). Data Storytelling for Data Management: A Data Quality Approach. Technics
Publications.
14. Evergreen, S. D. H. (2017). Effective Data Visualization: The Right Chart for the Right Data.
SAGE Publications.
15. Few, S. (2017). Data Visualization for Success: A Step-by-Step Guide to Making Effective Data
Visuals. Analytics Press.
16. Healy, K. (2018). Data Visualization: A Practical Introduction. Princeton University Press.
17. Heer, J., Bostock, M., & Ogievetsky, V. (2010). A Tour Through the Visualization Zoo.
Communications of the ACM, 53(6), 59-67.
18. Kirk, A. (2016). Data Visualisation: A Handbook for Data Driven Design. SAGE Publications.
19. Kirk, A. (2019). Data Visualisation: A Handbook for Data Driven Design (2nd ed.). SAGE
Publications.
20. Munzner, T. (2014). Visualization Analysis and Design. CRC Press.
21. Nussbaumer Knaflic, C. (2015). Storytelling with Data: A Data Visualization Guide for Business
Professionals. Wiley.
22. Nussbaumer Knaflic, C. (2019). Storytelling with Data: Let's Practice!. Wiley.
23. O'Dwyer, A. (2021). Data Visualization in Excel: A Guide for Beginners, Intermediates, and
Professionals. Routledge.
24. Rahlf, T. (2019). Data Visualisation with R: 111 Examples. Springer.
25. Schwabish, J. (2021). Better Data Visualizations: A Guide for Scholars, Researchers, and Wonks.
Columbia University Press.
255
Bhumi Publishing, India
26. Sedlmair, M., Meyer, M., & Munzner, T. (2016). Design Study Methodology: Reflections from the
Trenches and the Stacks. IEEE Transactions on Visualization and Computer Graphics, 18(12),
2431-2440.
27. https://filippovalle.medium.com/the-principle-of-proportional-ink-c8c528d12d4d
28. Tufte, E. R. (2001). The Visual Display of Quantitative Information. Graphics Press.
29. Tufte, E. R. (2020). Seeing with Fresh Eyes: Meaning, Space, Data, Truth. Graphics Press.
30. Ware, C. (2020). Information Visualization: Perception for Design (4th ed.). Morgan Kaufmann.
31. Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer.
32. Wilke, C. O. (2019). Fundamentals of Data Visualization: A Primer on Making Informative and
Compelling Figures. O'Reilly Media.
33. Wexler, S., Shaffer, J., & Cotgreave, A. (2017). The Big Book of Dashboards: Visualizing Your
Data Using Real-World Business Scenarios. Wiley.
34. Yau, N. (2013). Data Points: Visualization That Means Something. Wiley.
35. Yau, N. (2018). Visualize This: The FlowingData Guide to Design, Visualization, and Statistics.
Wiley.
36. Zelazny, G. (2015). The Say It With Charts Complete Toolkit. McGraw-Hill Education.
37. Zelazny, G. (2015). Say It With Charts: The Executive’s Guide to Visual Communication.
McGraw-Hill Education.
38. Zelazny, G. (2015). Say It With Presentations: How to Design and Deliver Successful Business
Presentations. McGraw-Hill Education.
39. Tableau. (n.d.). Tableau Public Documentation. Retrieved from https://public.tableau.com/en-us/s/
40. Microsoft Power BI. (n.d.). Power BI Documentation. Retrieved from
https://docs.microsoft.com/en-us/power-bi/
41. Google Flu Trends. (n.d.). Tracking Flu Outbreaks Using Data Visualization. Retrieved from
https://www.google.org/flutrends/
42. The New York Times. (2020). Election Data Visualization and Trends. Retrieved from
https://www.nytimes.com/interactive/
43. Aaron J, Chew TL (2021) A guide to accurate reporting in digital image processing – Can anyone
reproduce your quantitative analysis? J Cell Sci 134:. https://doi.org/10.1242/jcs.254151
44. Bahadır B, Atik OŞ, Kanatlı U, Sarıkaya B (2023) A brief introduction to medical image
processing, designing and 3D printing for orthopedic surgeons. Jt Dis Relat Surg 34:451–454.
https://doi.org/10.52312/jdrs.2023.57912
45. Gandomi AH (2021) Cyberstalking Victimization Model Using Criminological Theory : A
Systematic Literature Review , Taxonomies , Applications , Tools , and Validations
46. Kaur H, Sohi N (2017) A Study for Applications of Histogram in Image Enhancement. Int J Eng
Sci 06:59–63. https://doi.org/10.9790/1813-0606015963
47. Ablin R, Sulochana CH, Prabin G (2020) An investigation in satellite images based on image
enhancement techniques. Eur J Remote Sens 53:86–94.
https://doi.org/10.1080/22797254.2019.1673216
48. Kisekka I, Peddinti SR, Savchik P, et al (2024) Multisite evaluation of microtensiometer and
osmotic cell stem water potential sensors in almond orchards. Comput Electron Agric 227:109547.
https://doi.org/10.1016/j.compag.2024.109547
256
Fundamentals of Data Handling and Visualization
(ISBN: 978-93-48620-52-1)
49. Zhuang L, Guan Y (2019) Image enhancement using modified histogram and log-exp
transformation. Symmetry (Basel) 11:1–17. https://doi.org/10.3390/SYM11081062
50. Browarska N, Kawala-Sterniuk A, Zygarlicki J, et al (2021) Comparison of smoothing filters’
influence on quality of data recorded with the emotiv epoc flex brain–computer interface headset
during audio stimulation. Brain Sci 11:1–23. https://doi.org/10.3390/brainsci11010098
51. Siddiqi MH, Alhwaiti Y (2022) Signal-to-Noise Ratio Comparison of Several Filters against
Phantom Image. J Healthc Eng 2022:. https://doi.org/10.1155/2022/4724342.
52. Batool, R. and Mou, J. (2024) ‘A systematic literature review and analysis of try-on technology:
Virtual fitting rooms’, Data and Information Management, 8(2), p. 100060.
doi:10.1016/J.DIM.2023.100060.
53. Van Duc, T. et al. (2023) ‘A Hybrid Photorealistic Architecture Based on Generating Facial
Features and Body Reshaping for Virtual Try-on Applications’, Mendel, 29(2), pp. 97–110.
doi:10.13164/mendel.2023.2.097.
54. Feng, Y. and Xie, Q. (2019) ‘Privacy Concerns, Perceived Intrusiveness, and Privacy Controls: An
Analysis of Virtual Try-On Apps’, Journal of Interactive Advertising, 19(1), pp. 43–57.
doi:10.1080/15252019.2018.1521317.
55. Ghodhbani, H. et al. (2022) You can try without visiting : a comprehensive survey on virtually try-
on outfits. Multimedia Tools and Applications.
56. Goel, P., Mahadevan, K. and Punjani, K.K. (2023) ‘Augmented and virtual reality in apparel
industry: a bibliometric review and future research agenda’, Foresight, 25(2), pp. 167–184.
doi:10.1108/FS-10-2021-0202/FULL/XML.
57. Hu, B. et al. (2022) ‘SPG-VTON: Semantic Prediction Guidance for Multi-Pose Virtual Try-on’,
IEEE Transactions on Multimedia, 24(8), pp. 1233–1246. doi:10.1109/TMM.2022.3143712.
58. Hwangbo, H. et al. (2020) ‘Effects of 3D Virtual “Try-On” on Online Sales and Customers’
Purchasing Experiences’, IEEE Access, 8, pp. 189479–189489.
doi:10.1109/ACCESS.2020.3023040.
59. Islam, T. et al. (2024) ‘Deep Learning in Virtual Try-On: A Comprehensive Survey’, IEEE Access,
12(February), pp. 29475–29502. doi:10.1109/ACCESS.2024.3368612.
60. Lagė, A. et al. (2020) ‘Comparative study of real and virtual garments appearance and distance
ease’, Medziagotyra, 26(2), pp. 233–239. doi:10.5755/j01.ms.26.2.22162.
61. Lavoye, V., Tarkiainen, A., et al. (2023) ‘More than skin-deep: The influence of presence
dimensions on purchase intentions in augmented reality shopping’, Journal of Business Research,
169(August). doi:10.1016/j.jbusres.2023.114247.
62. Lavoye, V., Sipilä, J., et al. (2023) ‘The emperor’s new clothes: self-explorative engagement in
virtual try-on service experiences positively impacts brand outcomes’, Journal of Services
Marketing, 37(10), pp. 1–21. doi:10.1108/JSM-04-2022-0137.
63. Lee, H. and Xu, Y. (2020) ‘Classification of virtual fitting room technologies in the fashion
industry: from the perspective of consumer experience’, International Journal of Fashion Design,
Technology and Education, 13(1), pp. 1–10. doi:10.1080/17543266.2019.1657505.
64. Liu, Yu et al. (2024) ‘Arbitrary Virtual Try-on Network: Characteristics Preservation and Tradeoff
between Body and Clothing’, ACM Transactions on Multimedia Computing, Communications and
Applications, 20(5), pp. 1–12. doi:10.1145/3636426.
257
Bhumi Publishing, India
65. Ren, B. et al. (2023) ‘Cloth Interactive Transformer for Virtual Try-On’, ACM Transactions on
Multimedia Computing, Communications and Applications, 20(4). doi:10.1145/3617374.
66. Savadatti, M.B. et al. (2022) ‘Theoretical Analysis of Viola-Jones Algorithm Based Image and
Live-Feed Facial Recognition’, Proceedings - IEEE International Conference on Advances in
Computing, Communication and Applied Informatics, ACCAI 2022 [Preprint].
doi:10.1109/ACCAI53970.2022.9752590.
67. Zhang, T. (2019) ‘The role of virtual try-on technology in online purchase decision from
consumers ’ aspect Internet Research Article information :’, (March). doi:10.1108/IntR-12-2017-
0540.
68. Aamir, M. et al. (2021) ‘Efficiently Processing Spatial and Keyword Queries in Indoor Venues’.
doi:10.1109/TKDE.2020.2964206.
69. Hu, B. et al. (2022) ‘SPG-VTON: Semantic Prediction Guidance for Multi-Pose Virtual Try-on’,
IEEE Transactions on Multimedia, 24(8), pp. 1233–1246. doi:10.1109/TMM.2022.3143712.
70. Lavoye, V. et al. (2023) ‘The emperor’s new clothes: self-explorative engagement in virtual try-on
service experiences positively impacts brand outcomes’, Journal of Services Marketing, 37(10), pp.
1–21. doi:10.1108/JSM-04-2022-0137.
71. Liu, Yu et al. (2024) ‘Arbitrary Virtual Try-on Network: Characteristics Preservation and Tradeoff
between Body and Clothing’, ACM Transactions on Multimedia Computing, Communications and
Applications, 20(5), pp. 1–12. doi:10.1145/3636426.
258