Data Visualization
Data Visualization
ENGINEERING
COMPUTER SCIENCE ENGINEERING
Data Visualization
(CSH-461)
Data extraction is the first and perhaps most important step of the Extract/Transform/Load (ETL)
process. Through properly extracted data, organizations can gain valuable insights, make informed
decisions, and drive efficiency within all workflows.
Data extraction is crucial for almost all organizations since there are multiple different sources
generating large amounts of unstructured data. Therefore, if the right data extraction techniques are not
applied, organizations not only miss out on opportunities but also end up wasting valuable time, money,
and resources.
1.Association
2.Classification
3.Clustering
4.Regression
1. Association
Association data extraction technique operates and extracts data based on the
relationships and patterns between items in a dataset. It works by identifying
frequently occurring combinations of items within a dataset. These relationships, in
turn, help create patterns in the data.
Furthermore, this method uses “support” and “confidence” parameters to identify
patterns within the dataset and make it easier for extraction. The most frequent use
cases for association techniques would be invoices or receipts data extraction.
2. Classification
Classification-based data extraction techniques are the most widely accepted, easiest,
and efficient methods of data extraction. In this technique, data is categorized into
predefined classes or labels with the help of predictive algorithms. Based on this
labelled data, models are created and trained for classification-based extraction.
A common use case for classification-based data extraction techniques would be in
managing digital mortgage or banking systems.
3. Clustering
Clustering data extraction techniques apply algorithms to group similar data points
into clusters based on their characteristics. This is an unsupervised learning technique
and does not require prior labelling of the data.
Clustering is often used as a prerequisite for other data extraction algorithms to
function properly. The most common use case for clustering is when extracting visual
data, from images or posts, where there can be many similarities and differences
4. Regression
Each dataset consists of data with different variables. Regression data extraction
techniques are used to model relationships between one or more independent
variables and a dependent variable.
Regressive data extraction applies different sets of values or “continuous values” that
define the variables of the entities associated with the data. Most commonly,
organizations use regression data extraction for identifying dependent and
independent variables with datasets.
4
Types of Data Extraction
Organizations use multiple different types of data extraction such as Manual,
Traditional OCR-based, Web scraping, etc. Each data extraction method uses a
particular data extraction technique that we read earlier.
2. Web Scraping
Web scraping refers to the extraction of data from a website. This data is then
exported and collected in a format more useful for the user, be it a spreadsheet or an
API. Although web scraping can be done manually, in most cases it is done with the
help of automated bots or crawlers as they can be less costly and work faster.
However, in most cases, web scraping is not a straightforward task. Websites come in
many different formats and can have challenges such as captchas, etc. to avoid as
well.
5
3. OCR-based data extraction
Optical Character Recognition or OCR refers to the extraction of data from printed or
written text, scanned documents, or images containing text and converting it into
machine-readable format. OCR-based data extraction methods require little to no
manual intervention and have a wide variety of uses across industries.
OCR tools work by preprocessing the image or scanned document and then identifying
the individual character or symbol by using pattern matching or feature recognition.
With the help of deep learning, OCR tools today can read 97% of the text correctly
regardless of the font or size and can also extract data from unstructured documents.
6
5. AI-enabled data extraction
AI-enabled data extraction technique is the most efficient way to extract data while
reducing errors. This automates the entire extraction process requiring little to no
manual intervention while also reducing the time and resources invested in this
process.
AI-based document processing utilizes intelligent data interpretation to understand the
context of the data before extracting it. It also cleans up noisy data, removes
irrelevant information, and converts data into a suitable format. AI in data extraction
largely refers to the use of Machine Learning (ML), Natural Language Processing (NLP),
and Optical Character Recognition (OCR) technologies to extract and process the data.
Automate manual data entry using Nanonet’s AI-based OCR software. Capture data
from documents instantly. Reduce turnaround times and eliminate manual effort.
6. API Integration
API integration is one of the most efficient methods of extracting and transferring
large amounts of data. An API enables fast and smooth extraction of data from
different types of data sources and consolidation of the extracted data in a centralized
system.
One of the biggest advantages of API is that the integration can be done between
almost any type of data system and the extracted data can be used for multiple
different activities such as analysis, generating insights, or creating reports.
7
7. Text pattern matching
Text pattern matching or text extraction refers to the finding and retrieving of specific
patterns within a given data set. A specific sequence of characters or patterns needs
to be predefined which will then be searched for within the provided data set.
This data extraction type is useful for validating data by finding specific keywords,
phrases, or patterns within a document.
8. Database querying
Database querying is the process of requesting and retrieving specific information or
data from a database management system (DBMS) using a query language. It allows
users to interact with databases to extract, manipulate, and analyse data based on
their specific needs.
Structured query language (SQL) is the most commonly used query language for
relational databases. Users can specify criteria, such as conditions, and filters, to fetch
specific records from the database. Database querying is essential for making
informed decisions and building data-driven businesses.
8
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.
Video Link:
https://www.youtube.com/watch?v=E7oACf4a24Y&pp=ygUaZGF0YSBleHRyYWN0aW9u
IGluIGVuZ2xpc2g%3D
9
THANK YOU
For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING
Data Visualization
(CSH-461)
12
Data Integration Architecture Types
13
2. Real-time Integration Architecture
Real-time integration architecture enables the continuous flow of data in near-real-
time or real-time from source systems to target systems. It involves capturing data
changes as they occur and immediately propagating them to the target systems.
This architecture is commonly used in scenarios where immediate data availability
and responsiveness are critical, such as online transaction processing (OLTP)
systems or real-time analytics.
14
4. Extract, Load, Transform (ELT) Architecture
ELT involves extracting data from source systems, loading it into a target system or
data lake, and then performing transformations directly within the target system. ELT
leverages the processing power and scalability of the target system to perform
complex transformations, as opposed to traditional extract, transform, load (ETL)
architectures where transformations happen before loading the data. ELT
architecture is well-suited for scenarios with large volumes of data and where the
target system can handle the transformation processes efficiently.
16
1.Data Cube Aggregation:
A data cube is constructed using the operation of data aggregation.
3.Numerosity Reduction:
in this case, data preprocessing only stores model data and throws away
unnecessary data.
4.Dimensionality Reduction:
using various encoding mechanisms, the size of the data can be reduced. Depending
on how it’s done, one may or may not lose data. If after reduction, one is able to
successfully retrieve reduced data, then it is considered lossless. If otherwise, then
the data is lost for good.
17
Data Transformation
Data transformation plays a crucial role in data management. This process reshapes
data into formats that are more conducive to analysis, unlocking its potential to
inform and guide strategic decision-making. It encompasses a spectrum of
techniques such as cleaning, aggregating, restructuring, and enriching, each
designed to refine data into a more usable and valuable asset.
As organizations increasingly rely on data-driven strategies for growth and efficiency,
understanding and mastering data transformation becomes essential. This guide
delves into the intricacies of data transformation, exploring its role, methodologies,
and impact on the overall data integration process.
19
Tools and Technologies for Data Transformation
Data transformation tools are diverse, each designed to address specific aspects of
data transformation. These tools can be broadly categorised as follows:
1.ETL (Extract, Transform, Load) Tools: These tools are fundamental in data
warehousing. They extract data from various sources, transform it into a suitable
format or structure, and then load it into a target system or database.
2.Data Cleaning Tools: Focused on improving data quality, these tools help in
identifying and correcting errors and inconsistencies in data.
Video Link:
https://www.youtube.com/watch?v=Kq4QgbhkqyE&pp=ygUuZGF0YSBpbnRlZ3JhdGlvbi
BpbiBkYXRhIHZpc3VhbGlzYXRpb24gZW5nbGlzaA%3D%3D
21
THANK YOU
For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING
Data Visualization
(CSH-461)
24
Data visualization helps us make sense of data by simplifying it and presenting it
in a clear way that is easy to understand. It can represent anything from complex
economic trends to simple relationships between different variables.
Data visualization can be done on paper or computer screens using software
programs such as Excel or Tableau; however, sometimes, it’s done using physical
objects such as maps or models. It helps us see patterns and relationships
between different pieces of information, which can help us make better and more
efficient decisions about our lives and businesses — whether looking at real
estate prices or trying to find out what your customers want most from your
company.
25
Importance of Data Visualizations
Video Link:
https://www.youtube.com/watch?v=MiiANxRHSv4&pp=ygU2cm9sZSBvZiB2aXN1YWxp
emF0aW9uIGluIGRhdGEgdmlzdWFsaXphdGlvbiBpbiBlbmdsaXNo
29
THANK YOU
For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING
Data Visualization
(CSH-461)
Scatter Plots:
Scatter plots are used to display the relationship between two continuous
variables. Each data point is represented by a dot on the plot, and the position of
the dot corresponds to the values of the variables being compared. Scatter plots
are ideal for identifying trends, clusters, and outliers in the data.
32
Line Charts:
Line charts are commonly used to show trends and changes over time. They
connect data points with lines, making it easy to visualize the progression of a
variable. Line charts are effective for illustrating patterns, seasonality, and
comparing multiple time series.
33
Bar Charts:
Bar charts are widely used to compare categories or discrete variables. The length
of each bar represents the value of the variable being displayed, and the bars can
be arranged vertically or horizontally. Bar charts are ideal for visualizing
frequency, distribution, and comparisons between different groups or categories.
34
Histograms:
Histograms provide a visual representation of the distribution of a continuous
variable. They group data into bins or intervals, and the height of each bar
represents the frequency or count of observations within that bin. Histograms are
useful for understanding the shape, central tendency, and spread of data.
35
Pie Charts:
Pie charts represent the proportion or percentage of different categories within a
whole. They are circular in shape, with each category represented by a slice of the
pie. Pie charts are suitable for showcasing the contribution of each category to
the whole and making comparisons between different parts.
36
Box Plots:
Box plots, also known as box-and-whisker plots, provide a visual summary of the
distribution of a continuous variable. They display the minimum, maximum,
median, and quartiles of the data. Box plots are useful for identifying outliers,
comparing distributions, and understanding the spread and skewness of the data.
37
Heatmaps:
Heatmaps use color gradients to represent the magnitude or intensity of values in
a matrix or two-dimensional dataset. They are particularly effective for visualising
large datasets and identifying patterns or clusters. Heatmaps are commonly used
in areas such as genomics, finance, and data analysis.
38
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.
Video Link:
https://www.youtube.com/watch?v=xVWKPSIBDIQ&pp=ygUyY2hhcnRzIGFuZ
CBwbG90cyBpbiB2aXN1YWxpemF0aW9uIGluIGVuZ2xpc2ggbnB0ZWw%3D
39
THANK YOU
For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING
Data Visualization
(CSH-461)
42
Importance of Data Cleaning
Clean data is fundamental for robust analysis and accurate modelling in data
science. If data is not adequately cleaned and prepared before analysis, several
significant problems can arise, impacting the reliability and accuracy of any
subsequent analysis or decision-making processes. Data that hasn’t been cleaned
appropriately can lead to questionable results. This can reduce the confidence
that stakeholders have in the analysis and its implications, potentially causing
delays in decision-making. Here are some reasons highlighting the importance of
data cleaning:
43
Importance of Data Cleaning
3. Effective Decision-Making:
Analyzing and presenting unclean data can reduce the credibility of the analysis
and the individuals or organizations responsible for it. Stakeholders may lose trust
in the accuracy and reliability of the findings. Decision-making is supported by
trustworthy data that is free of biases and inaccuracies. Managers and
stakeholders can confidently rely on the results of analyses based on clean data. 44
4. Avoidance of biased insights:
Inaccuracies in data can introduce biases that skew analysis and conclusions.
Data containing biases, such as imbalances, duplicate records, or incomplete
entries, can result in biased conclusions that may favor a particular group or
perspective. This can be detrimental in various domains, including business,
healthcare, finance, and the social sciences. Data cleaning helps mitigate these
biases and ensures that the insights drawn are representative and unbiased.
8. Cost Savings:
Data cleaning reduces unnecessary costs associated with erroneous data, such as
marketing to incorrect addresses or targeting the wrong audience. Analysing
unclean data can lead to wasted resources and time spent on investigating false
leads, correcting errors, and redoing analyses. In terms of both time and money,
this can be expensive. Correcting errors early in the data lifecycle is typically
more cost-effective than addressing issues later, especially after analysis or when
errors have propagated throughout the organization.
46
Data cleaning :
47
Data Cleaning Techniques
48
2. Outlier Detection and Treatment
Data points that deviate significantly from other observations are known as
outliers. Analysis and models can be distorted by outliers. Techniques like the Z-
score method or the IQR (interquartile range) method help identify and handle
outliers appropriately, ensuring they don’t skew the analysis. Managing outliers is
crucial for robust analysis.
· Visual Inspection: Calculate summary statistics (e.g., mean, median, standard
deviation, quartiles) to understand the central tendency and spread of the data.
Unusually large or small values could indicate outliers. To visualize and spot
outliers, use histograms, box plots, or scatter plots.
· Statistical Methods: Employ statistical techniques like Z-score or IQR
(interquartile range) to detect and handle outliers, either by removal or
transformation. Calculate the Z-score for each data point. Data points with a Z-
score beyond a threshold (e.g., 3) are considered outliers. The difference between
the first quartile (Q1) and the third quartile (Q3) is known as the IQR. Outliers fall
outside the range defined by Q1–1.5 * IQR and Q3 + 1.5 * IQR.
49
3. Data Transformation and Standardization
Data transformation includes converting data into a format that is acceptable for
analysis. This may include scaling features, encoding categorical variables, or
creating new features that are more informative for the intended analysis.
Standardizing data ensures uniformity and consistency, simplifying subsequent
analysis.
· Scaling: Scale numerical features to a specific range (e.g., [0, 1] or [-1, 1]) to
provide equal importance to all features.
· Normalization and standardization: Normalization scales the features between 0
and 1, while standardization transforms the data to have a mean of 0 and a
standard deviation of 1, making it easier to compare and analyze.
4. Removing Duplicates
· Tokenization: Break the text into smaller units, such as words or sentences.
Tokenization aids in text organization for additional processing.
· Removing special characters: Eliminate special characters, symbols, and non-
alphanumeric characters that don’t contribute to the meaning of the text.
· Removing Stop Words: Stop words are commonly occurring words in a language
that do not carry significant meaning. Eliminate common and non-informative
words (e.g., “and,” "the," and “is”) that do not provide meaningful insights for
analysis. Removing stop words helps reduce the dimensionality of the text data
and focus on the more informative terms.
· Lemmatization and Stemming: Reduce words to their root or base form
(stemming) or transform them to their dictionary form (lemmatization). This helps
standardize variations of words.
51
6. Handling Inconsistent Data
Inconsistencies in data formatting or values can hinder analysis. To make sure the
data is accurate, dependable, and appropriate for analysis or other uses, it is
necessary to handle inconsistent data. Data input errors, differing information
formats, divergent standards, or the integration of data from numerous sources
are just a few causes of inconsistent data. Here are steps to handle inconsistent
data:
· Identify Inconsistencies: Begin by thoroughly examining the dataset to identify
inconsistencies, including discrepancies, errors, or anomalies in the data.
· Data Formatting: Standardize date formats, capitalization, and other conventions
to maintain uniformity. Ensure that data follows consistent formats for dates,
times, currency, units, and other relevant formats. Convert inconsistent data to
the desired format.
· Data Validation Rules: Define and apply validation rules to ensure data adheres
to predefined patterns or constraints. Implement validation rules to ensure that
the data adheres to predefined criteria, such as a range of valid values or required
fields.
52
7. Feature Engineering and Selection
· Feature Selection: Choose the most relevant features to reduce noise and
improve model performance. To determine the most influential features, compute
correlations between the feature and the target variable. To rank features
according to how relevant they are to the goal variable, use statistical tests or
information gathering. Utilize methods like principal component analysis (PCA) or
singular value decomposition (SVD) to decrease the feature space’s
dimensionality while preserving the most crucial data. 53
8. Addressing Class Imbalance (for Classification Tasks)
Class imbalance in a classification task occurs when the number of instances in
each class significantly differs, potentially leading to biased models that favor the
majority class. Addressing this imbalance is crucial to creating models that can
make accurate predictions for all classes. For datasets that are unbalanced, with
one class greatly outnumbering another:
· Oversampling: Duplication or synthetic generation can be used to increase the
proportion of occurrences in the minority class. To balance the distribution of
classes, duplicate instances from the minority class at random. Create artificial
data points by using methods such as SMOTE (Synthetic Minority Over-sampling
Technique).
· Undersampling: To establish balance, lower the number of instances in the
majority class. To restore balance to the class distribution, arbitrarily eliminate
instances from the dominant class. Be cautious not to remove too much
information, which can lead to underrepresentation of the majority class.
· Class Weighting: Assign higher weights to the minority class during model
training to give it more importance. Many classification algorithms allow for class
weights to be specified, which can help mitigate the impact of class imbalance.
· Ensemble Techniques: Use ensemble methods like bagging (e.g., Random
Forest) or boosting (e.g., AdaBoost, XGBoost) that can handle class imbalance by
combining predictions from multiple weak learners.
54
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.
Video Link:
https://www.youtube.com/watch?v=mwEPXevpqls&pp=ygUvZGF0YSBjbGVhbmluZyBp
biB2aXN1YWxpemF0aW9uIGluIGVuZ2xpc2ggbnB0ZWw%3D
55
THANK YOU
For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING
Data Visualization
(CSH-461)
Numerical-Numerical Relationships
When dealing with numerical data, several visualization techniques help to
uncover relationships between variables.
i) Scatter Plot: Scatter plots are useful for visualizing the relationship between
two numerical variables. They can be enhanced using additional parameters such
as hue, size, and style to represent other dimensions.
58
59
ii Pair Plot: Pair plots provide a matrix of scatter plots for each pair of numerical
variables, allowing for a comprehensive view of their relationships. Including
a hue parameter can add a categorical dimension to the visualization.
60
iii) Line Plot: Line plots are particularly useful for time-series data, showing
trends over time. By grouping data by time intervals, such as years, and
summarizing it, we can observe patterns and changes.
61
· Categorical-Categorical Relationships
Understanding relationships between categorical variables can be achieved
through several techniques:
i) Crosstab: A crosstabulation (or crosstab) table displays the frequency
distribution of variables, providing a straightforward way to observe the
interaction between categorical variables.
62
ii) Heatmap: Heatmaps visually represent crosstab data, using color to indicate
the magnitude of values. This method highlights patterns and correlations
effectively.
63
iii) Cluster Map: Cluster maps extend heatmaps by applying clustering
algorithms to the rows and columns, revealing deeper structures and groupings in
the data.
64
· Numerical-Categorical Relationships
When analysing the relationship between numerical and categorical variables, the
following visualization techniques are particularly useful:
i) Bar Plot (with Confidence Intervals): Bar plots summarize the central
tendency of a numerical variable for each category of a categorical variable, often
including confidence intervals to indicate variability.
65
ii) Box Plot: Box plots display the distribution of a numerical variable across
different categories, highlighting the median, quartiles, and outliers.
66
iii) Dist plot: Dist plots (or distribution plots) are useful for comparing the
distributions of a numerical variable across different categories. They combine
histograms and kernel density plots to provide a comprehensive view.
67
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.
Video Link:
https://www.youtube.com/watch?v=AmNqUu_e4nQ
68
THANK YOU
For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING
Data Visualization
(CSH-461)
71
1. Recursive Pattern Technique:
Data values are displayed in a grid using recursive subdivision. The aim is to
group related data hierarchically, revealing patterns within sub-regions.
o The screen space is recursively divided into rectangular sub-regions.
o Each sub-region corresponds to a subset of data, organized hierarchically.
o Pixels within a sub-region are colored based on their data value.
Example
o Visualizing sales data by geographical region:
Top-level divisions represent countries.
Sub-regions represent states, cities, and stores.
o Patterns in sales performance can be identified across different levels.
72
2. Pixel Bar Charts
Combines traditional bar charts with pixel-based data representation. Each bar is
made up of pixels, where each pixel represents a data point.
o Bars are drawn for each category or time period.
o Each bar is filled with a sequence of pixels.
o The color of each pixel represents its data value.
•Example:
o Visualizing daily sales over months:
Each bar represents a month.
Pixels in the bar represent daily sales, with colours showing sales
volume.
73
3. Circle Segment Technique
A circular visualization where data is displayed in concentric segments. Each
segment represents a category, and pixels within segments encode data values.
o The circle is divided into segments like a pie chart.
o Each segment contains rows of pixels.
o The pixel colour represents the data value.
•Example:
o Visualizing air quality data:
Segments represent different cities.
Rows within each segment represent hourly readings.
74
4. Temporal Pixel Maps
Designed for time-series data, where pixels are arranged in chronological order to
reveal trends over time.
o Data values are arranged line-by-line (row-wise) or column-by-column.
o Pixels are coloured based on their data values.
o The chronological arrangement highlights trends and anomalies.
•Example:
o Visualizing temperature over a year:
Rows represent days, and columns represent hours.
Colour intensity indicates temperature.
75
Geometric Projection Visualization Techniques
Geometric projection techniques are methods for visualizing high-dimensional
data by projecting it onto lower-dimensional spaces (e.g., 2D or 3D). These
techniques help users identify patterns, clusters, and relationships in complex
datasets while preserving as much of the original structure as possible.
Types:
1. Scatterplots
o A scatterplot represents data points in a 2D or 3D space using Cartesian
coordinates.
o Each axis corresponds to a feature (variable) in the dataset.
o Scatterplots are ideal for datasets with up to three dimensions.
Example:
o Visualizing customer segmentation:
X-axis: Age, Y-axis: Income, Colour: Spending behaviour.
o This helps in identifying clusters of customers.
76
77
2. Principal Component Analysis (PCA)
o PCA is a linear dimensionality reduction technique that transforms high-
dimensional data into a lower-dimensional space while retaining the most
variance in the data.
78
Example:
o Genome Data Analysis:
PCA can reduce thousands of gene expression dimensions into 2-3
components, making visualization easier.
1. Star Glyphs
Star glyphs are radial visualizations where each data attribute is represented as
an axis radiating from a central point. The value of the attribute determines the
length of the axis.
o The number of axes corresponds to the number of attributes.
o Connecting the endpoints of the axes forms a polygon (the "star").
o The shape and size of the star provide a visual summary of the data.
Example:
•Student Performance Analysis:
o Attributes: Test scores in subjects like math, science, and English.
o Visualization: Each student gets a star glyph, showing strengths and
weaknesses in different subjects.
80
2. Chernoff Faces
Chernoff faces represent data attributes using features of a human face, such as
eyes, mouth, and nose. Each feature corresponds to an attribute, with variations
encoding data values.
o Faces are highly recognizable to humans, enabling quick pattern detection.
o Changes in facial features are intuitive and noticeable.
Example:
•Customer Satisfaction Analysis:
o Attributes: Likelihood of recommending, overall satisfaction, and frequency
of use.
o Visualization: Customers with high satisfaction might have larger eyes and
smiling mouths, while dissatisfied customers may have smaller eyes and
frowning mouths.
3. Stick Figures
Stick figures are minimalistic representations of data using simple "stick" icons.
Each limb or body part corresponds to a data attribute, and its orientation, length,
or angle represents the attribute's value.
o Stick figures are compact and can encode multiple attributes
simultaneously.
Example: 81
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.
Video Link:
https://www.youtube.com/watch?v=f79bJTZSAqc
82
THANK YOU
For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING
Data Visualization
(CSH-461)
85
•Vertical Layout:
o Commonly used in organizational charts.
o The root is at the top, and branches flow downward.
•Horizontal Layout:
o Suitable for decision trees or processes.
o The root node is on the left, and branches flow horizontally.
•Radial Layout:
o Nodes radiate outward from a central root node.
o Often used in biological taxonomies.
Tree diagrams are ideal for scenarios where the relationships between nodes are
of primary importance, such as representing family trees or explaining logical
processes like decision-making.
86
2. Treemaps
Treemaps use nested rectangles to represent hierarchical data. The entire
visualization space is divided into rectangles, and each rectangle represents a
node in the hierarchy. The size of each rectangle corresponds to a quantitative
attribute of the node, such as its value or importance.
•Nested Structure:
o Parent nodes are larger rectangles containing smaller rectangles for their
child nodes.
•Color Coding:
o Different colors are often used to represent additional attributes like
categories or performance metrics.
•Proportional Areas:
o The area of each rectangle directly reflects the value of the node, making it
easy to compare sizes.
Treemaps are widely used in financial analysis, where they can show portfolio
performance or sales data by dividing sectors into sub-sectors.
87
88
3. Circle Packing
Circle packing visualizes hierarchical data using nested circles. The root node is
the largest circle, and it contains smaller circles representing child nodes. This
method emphasizes containment and the relative sizes of nodes.
•Parent-Child Containment:
o Parent nodes are visually represented by the largest circles, with their child
nodes nested inside them.
•Proportional Sizes:
o The size of each circle corresponds to the importance or value of the node.
•Compact Representation:
o Circle packing makes efficient use of space while providing an aesthetically
pleasing representation.
This technique is often used in environmental studies to represent ecosystems,
where larger circles might represent broader ecological categories, and smaller
circles represent species.
89
90
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.
Video Link:
https://www.youtube.com/watch?v=GFJF1s6hL6s&pp=ygUrSGllcmFyY2hpY2FsIFZpc3V
hbGl6YXRpb24gaW4gZW5nbGlzaCBucHRlbA%3D%3D
91
THANK YOU
For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING
Data Visualization
(CSH-461)
94
Colour harmony: This refers to the way that colours are arranged together to
create a pleasing effect. There are a number of different colour harmonies that
can be used in visualization, such as complementary colours, analogous colours,
and triadic colours.
Colour contrast: This refers to the difference between two or more colours. Colour
contrast can be used to create visual interest and to highlight important data
points.
Colour symbolism: This refers to the way that colours are associated with different
meanings. For example, red is often associated with passion, blue is often
associated with calmness, and green is often associated with nature.
Colour theory can be a complex subject, but it is an important one for anyone
who wants to create effective visualizations. By understanding the principles of
colour theory, you can use colour to communicate your message more effectively
and to create visualizations that are more visually appealing.
95
Applications of Colour Theory
97
Data Types and Visual Variables
1. Data Types
Data types in visualization determine how data is represented and analysed. They guide the selection
of appropriate charts, graphs, and visual encodings.
Types of Data
3.Ordinal Data:
o Represents ordered categories.
o Examples: Survey rankings (e.g., poor, average, excellent).
o Visualizations: Ordered bar charts, stacked bar charts.
98
4. Nominal Data:
o Represents non-ordered categories.
o Examples: Colors, brands, countries.
o Visualizations: Pie charts, categorical scatter plots.
5. Time-Series Data:
o Represents data points over time.
o Examples: Stock prices, temperature over days.
o Visualizations: Line charts, area charts.
6. Geospatial Data:
o Represents data tied to geographic locations.
o Examples: City population, rainfall in regions.
o Visualizations: Maps, choropleth maps.
7. Multivariate Data:
o Represents multiple variables simultaneously.
o Examples: Age, income, and education levels of individuals.
o Visualizations: Bubble charts, parallel coordinate plots
99
2. Visual Variables
Visual variables are the building blocks of data visualization. They are the attributes of graphical
elements that encode data and make patterns, relationships, or trends observable.
Types of Visual Variables
1. Position:
o Placement of data points on axes.
o Example: A scatter plot uses position to represent two variables on the x
and y axes.
2. Size:
o Represents magnitude through the size of an object.
o Example: Bubble charts use the size of bubbles to indicate a quantitative
variable.
3. Shape:
o Distinguishes between categories using different shapes.
o Example: A scatter plot uses circles, squares, or triangles for different
groups.
4. Color (Hue):
o Represents categories or intensity.
100
o Example: A heatmap uses gradients to show intensity; different hues
5. Brightness (Value):
o Represents magnitude by varying lightness or darkness.
o Example: A choropleth map uses darker shades for higher values.
6. Texture/Pattern:
o Differentiates areas or categories using patterns.
o Example: Bar charts use striped or dotted patterns to show subcategories.
7. Orientation:
o Uses angles or directions to encode data.
o Example: Arrows in vector field visualizations show wind direction.
8. Length:
o Represents quantitative differences by varying line lengths.
o Example: Bar charts use the length of bars to show values.
9. Width:
o Uses thickness to encode quantitative values.
o Example: Line graphs with varying thickness for different variables.
10. Enclosure:
o Groups related data points by enclosing them in shapes.
101
o Example: Venn diagrams.
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.
Video Link:
https://www.youtube.com/watch?v=YeI6Wqn4I78&pp=ygUXQ29sb3IgVGhlb3J5IGluIGV
uZ2xpc2g%3D
102
THANK YOU
For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING
Data Visualization
(CSH-461)
105
2. Stacked Bar Chart
A stacked bar chart is a bar graph with multiple data series stacked end-to-end
with the far-right end of the bar representing the totals of all the components in
the bar. The x-axis represents quantitative data while the y-axis categorical data.
Stacked bar graphs are used to show how a larger component is divided into its
various entities, and the relative effect each entity has on the total entity i.e., part
-to-whole relationship. Each data series takes a different shade or color explained
using a legend.
106
3. 100% Stacked Bar Chart
This variation of the stacked bar chart plots the percent of the values instead of
the actual values. The total of each stacked bar always equals 100%.
107
4. Column Charts (Vertical Bar graphs)
Colum charts or column graphs are bar charts with vertical bars. In column charts,
the values of the categorical data is shown along the y-axis of the graph, and the
height of the bars denote the values.
108
5. Stacked Column Charts
Like stacked bar graphs, stacked column charts have each bar representing the
whole and each segment denotes the various parts of the whole. The y-axis
represents quantitative data while the x-axis categorical data.
109
6. 100% Stacked Column Graph
This variation of the stacked column chart plots the percent of the values instead
of the actual values. The total of each stacked bar always equals 100%.
110
7. Grouped Bar Charts and Grouped Column Charts
A grouped bar/column chart (clustered bar/column graph) is another variation of
the bar/column chart that compares different categories of two or more groups.
The categories are grouped and arranged side-by-side making interpretation easy
inside the groups and even between the same categories. They are useful in
making comparisons across different categories of data.
111
Maps in Data Visualization:
Map visualization is used to analyse and display the geographically related data
and present it in the form of maps. This kind of data expression is clearer and
more intuitive. We can visually see the distribution or proportion of data in each
region. It is convenient for everyone to mine deeper information and make better
decisions.
1. Point Map
Point maps are straightforward, especially for displaying data with a wide
distribution of geographic information. For example, some companies have a wide
range of business. If the company wants to view the data of each site (specific
location) in a certain area, it will be more complicated to implement with general
maps, and the accuracy is not high. Then you can use the point map for precise
and fast positioning.
2. Line Map
You may not use line maps often, because they are relatively difficult to draw.
However, the line map sometimes contains not only space but also time. For the
analysis of special scenes, its application value is particularly high.
3. Regional Map
A regional map is also called a filling map. It can be displayed by country,
province, city, district or even some customized maps. You can know data sizes 112
4. Flow Map
Flow maps are often used to visualize origin-destination flow data. The origin and
destination can be points or surfaces. The interaction data between the origin and
the destination is usually expressed by a line that connects the geometric centre
of gravity of the space unit. The width or colour of the line indicates the flow
direction value between the origin and the destination. Each spatial location can
be either a origin or a destination.
5. Heatmap
The heatmap is used to indicate the weight of each point in the geographical
range. It is usually displayed in a special highlight. As shown in the figure below, it
is a haze map. The darker the colour of the area, the worse the air quality of the
area.
113
Trees in Data Visualization
In data visualization, trees represent hierarchical or nested data structures. They
use nodes and edges to display relationships between items, typically illustrating
parent-child or ancestor-descendant relationships. Tree-based visualizations make
complex data easier to explore and interpret.
Components of a Network
1.Nodes:
o Represent entities, objects, or data points.
o Example: People in a social network, servers in a computer network.
2.Edges:
o Represent relationships, interactions, or connections between nodes.
o Can be directed (with an arrow indicating direction) or undirected (no
direction).
3.Weights:
o Represent the strength or magnitude of a connection.
o Example: Friendship strength in social networks, distance in transportation
networks.
4.Clusters (Communities):
o Groups of nodes with denser connections within the group compared to
outside.
5.Attributes: 115
Types of Network Visualizations
1.Node-Link Diagrams:
o Use points for nodes and lines for edges.
o Example: Social networks, communication networks.
2.Adjacency Matrices:
o Use a grid format to represent connections between nodes.
o Example: Flight route maps, co-occurrence matrices.
3.Force-Directed Layouts:
o Arrange nodes based on attractive and repulsive forces to avoid overlap
and highlight relationships.
o Example: Relationship maps, knowledge graphs.
4.Hierarchical Networks:
o Represent data with a clear hierarchy or flow.
o Example: Organization charts, family trees.
5.Geospatial Networks:
o Overlay network connections on maps.
o Example: Road networks, internet infrastructure.
6.Dynamic Networks:
o Represent networks that evolve over time.
o Example: Social media interaction networks over days or weeks.
116
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.
Video Link:
https://www.youtube.com/watch?v=xVWKPSIBDIQ&pp=ygUyY2hhcnRzIGFuZCBwbG90
cyBpbiB2aXN1YWxpemF0aW9uIGluIGVuZ2xpc2ggbnB0ZWw%3D
117
THANK YOU
For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING
Data Visualization
(CSH-461)
120
1. Manual Data Collection
Manual data collection involves human intervention to gather information, often
through direct interaction with subjects or environments. This method is typically
used when qualitative insights or specific information from individuals are
required.
121
•Interviews:
o Data is gathered through one-on-one or group interactions where
participants provide in-depth responses to questions.
o Advantages: Provides rich, detailed data; enables clarification and follow-
up questions; useful for sensitive topics.
o Disadvantages: Time-consuming, interviewer bias, and limited scalability
(i.e., difficult to reach a large sample size).
•Observations:
o Collecting data by observing subjects in a natural or controlled
environment. This method can be either participatory (where the observer
is involved in the activities) or non-participatory (where the observer is a
passive observer).
o Advantages: Useful for collecting behavioral data, often provides real-time
insights.
o Disadvantages: Subject to observer bias, time-consuming, and can lack
generalizability.
•Fieldwork:
o This method involves collecting data in natural settings, often used in social
sciences, anthropology, and environmental studies.
o Advantages: High ecological validity (real-world context), flexible and
adaptable.
122
o Disadvantages: Resource-intensive, time-consuming, and researcher bias.
2. Automated Data Collection
Automated data collection involves the use of technology (sensors, devices, or
software) to gather data with minimal human intervention. This method is highly
efficient for large-scale, real-time, and quantitative data collection.
Methods of Automated Data Collection:
•Web Scraping:
o Web scraping uses automated scripts or tools to extract data from
websites. It can be used to collect data like product prices, social media
mentions, or news articles.
o Advantages: Large-scale data collection, cost-effective compared to
manual methods.
o Disadvantages: Legal and ethical issues (e.g., violating terms of service),
data may be unstructured or inconsistent.
123
•Application Programming Interfaces (APIs):
o APIs allow software systems to request data from other systems. For
example, social media platforms like Twitter or Google provide APIs to
access their data programmatically.
o Advantages: Easy to integrate into systems, real-time data, can be
automated.
o Disadvantages: Limited by the API’s rate limits, may require technical
expertise, data availability depends on third-party services.
•Data Logging:
o Automated systems can log data over time, often used for scientific
experiments, industrial systems, or machine monitoring.
o Advantages: Continuous, real-time data collection, accurate and
consistent.
o Disadvantages: Data can be voluminous, requiring significant storage and
processing resources.
124
3. Remote Sensing
Remote sensing refers to the acquisition of data about an object or area from a
distance, typically using satellites or aerial sensors. This method is primarily used
for geospatial data collection in fields like meteorology, agriculture, and
environmental monitoring.
Methods of Remote Sensing:
•Satellite Imaging:
o Satellites equipped with cameras or sensors capture high-resolution images
and data of the Earth’s surface. This includes visual imagery, infrared,
radar, and other spectral data.
o Advantages: Covers large areas, real-time or near-real-time data, non-
invasive, useful for environmental and geospatial studies.
o Disadvantages: High cost, limited resolution for some sensors, and
affected by weather conditions.
125
•Aerial Surveys (Drones):
o Drones or UAVs (Unmanned Aerial Vehicles) can capture high-resolution
images, videos, and other data types from the air. Drones are often used
for mapping, monitoring crop health, and inspecting infrastructure.
o Advantages: High resolution, flexible, can be deployed quickly, relatively
low cost.
o Disadvantages: Limited by flight time and range, may require regulatory
approval, can be impacted by weather.
126
•LiDAR (Light Detection and Ranging):
o A remote sensing technology that uses laser pulses to measure distances
to the Earth’s surface, creating precise 3D maps of landscapes.
o Advantages: Highly accurate, can penetrate vegetation, and create
detailed topographical models.
o Disadvantages: Expensive equipment, requires skilled operators, and
large amounts of data processing.
•Hyperspectral Imaging:
o This technique captures a broad range of spectral bands beyond visible
light to detect materials, environmental conditions, and more.
o Advantages: Detects materials and conditions invisible to the human eye,
detailed data.
o Disadvantages: Complex data analysis, expensive, and large data storage
requirements.
127
4. Transactional Data Collection
Transactional data is generated through exchanges between systems or users.
This data typically includes records of actions, transactions, or communications
and is often stored in digital databases.
Methods of Transactional Data Collection:
•Online Transactions:
o Data gathered from digital purchases or activities on websites and apps,
such as e-commerce platforms, banking, or gaming.
o Advantages: Real-time data collection, large data volumes, easy to
analyse.
o Disadvantages: Privacy issues, data security concerns.
•Sensor Data:
o Devices or sensors that track actions, such as RFID tags in logistics or 128
sensors in smart devices, gather transactional data based on movements,
5. Data Mining
Data mining refers to the process of discovering patterns, relationships, or trends
in large datasets, often using machine learning or statistical techniques. This
method is widely used for uncovering hidden insights from transactional or large-
scale data.
Methods of Data Mining:
•Clustering:
o Grouping similar data points together. For example, clustering customer
data based on purchasing behavior.
o Advantages: Helps identify natural groupings in data, such as customer
segments.
o Disadvantages: Results depend on the algorithm used, may not always
find meaningful groups.
•Classification:
o Categorizing data points into predefined classes. For example, classifying
emails as spam or not spam.
o Advantages: Useful for predictive modeling, easy to interpret.
o Disadvantages: Requires labeled data, sensitive to data quality.
129
•Association Rule Mining:
o Discovering interesting relations or patterns between variables, such as
items frequently bought together in a supermarket.
o Advantages: Uncovers hidden relationships, useful for marketing and
recommendation systems.
o Disadvantages: May produce irrelevant or trivial rules.
•Anomaly Detection:
o Identifying outliers or unusual patterns in data, often used for fraud
detection or quality control.
o Advantages: Effective for finding fraud or errors.
o Disadvantages: May generate false positives or miss subtle anomalies.
130
Classification of Information Sources
Information sources refer to the origins from which data is gathered.
Understanding the classification of information sources is critical for selecting the
appropriate data for research, analysis, or decision-making. Information sources
are typically classified based on their origin, whether the data is collected
firsthand or second-hand, the format of the data, or the method of access. Below
is a detailed explanation of the classification of information sources:
131
Examples of Primary Data Sources:
•Surveys and Questionnaires: Data is collected through structured questions
either in person, via phone, or online. These can include closed-ended
(quantitative) or open-ended (qualitative) questions.
o Example: Market research survey for consumer preferences.
•Interviews: Data collected through one-on-one or group interactions where
participants respond to open or structured questions.
o Example: Expert interviews for academic research on healthcare systems.
•Experiments: Data gathered from controlled experiments where researchers
manipulate variables and measure outcomes. This is common in scientific,
medical, and social research.
o Example: Clinical trial data on the efficacy of a new drug.
•Observations: Data collected through direct observation of subjects or
phenomena. This is often used in natural sciences or social sciences where human
behaviour is studied.
o Example: Observing customer behaviour in a retail store.
•Case Studies: Data obtained from in-depth analysis of a particular individual,
group, event, or community.
o Example: A study on the impact of a new educational policy in a school.
•Fieldwork: Data collected in natural settings outside of a laboratory or
controlled environment. Common in anthropology, sociology, and environmental
sciences. 132
o Example: Anthropological research in remote villages or ecosystems.
2. Secondary Data Sources
Secondary data is data that has been collected by someone else for a purpose
other than the current research or analysis. This type of data is often used when
primary data is not available, too costly, or unnecessary for the specific analysis.
133
Examples of Secondary Data Sources:
•Government Reports and Statistics: Data published by government agencies
on various sectors like economics, health, education, and more.
o Example: U.S. Census data or national health surveys.
•Academic Research Papers: Published studies and academic articles that
provide insights, data, and findings on a particular topic.
o Example: Research articles in medical journals or social science papers.
•Industry Reports: Data collected and published by research firms or industry
experts.
o Example: Annual market research reports from firms like Nielsen or
McKinsey.
•Public Databases: Data repositories maintained by organizations, universities,
or governments that provide access to research datasets or public information.
o Example: World Bank Data, Google Scholar, or clinical trial databases.
•Books and Textbooks: Published books that may contain valuable data, case
studies, or historical data relevant to a field of study.
o Example: Textbooks with aggregated educational data, research
summaries, or historical data analyses.
•Media Sources: News articles, reports, and online publications that provide
secondary data on current events, trends, and societal changes.
o Example: News reports or magazine articles analyzing the impacts of
climate change. 134
3. Tertiary Data Sources
Tertiary sources are those that summarize, index, or compile data from primary
and secondary sources. These sources offer a high-level overview and are useful
for getting general information or an introduction to a topic. They do not contain
raw data themselves but point to other sources.
Characteristics of Tertiary Data:
•Summarizes or compiles information from primary and secondary sources.
•Provides high-level overviews or references rather than detailed analysis.
•Often used for general information or quick reference.
Examples of Tertiary Data Sources:
•Encyclopedias: Provide summaries of topics, often with references to primary
and secondary sources for more in-depth information.
o Example: Britannica Online or Wikipedia.
•Dictionaries and Thesauruses: Provide definitions or synonyms for terms and
concepts.
o Example: Merriam-Webster Dictionary.
•Bibliographies and Indexes: Lists or indexes of other sources (books, journals,
articles) that are relevant to a particular subject or field.
o Example: Indexes in academic journals that list relevant articles by subject.
•Almanacs: Provide factual data, statistics, or summaries of information for a
specific year, often compiled from other sources.
o Example: World Almanac, which provides data on global statistics and 135
events.
4. Open Data Sources
Open data refers to data that is freely available to the public and can be used,
modified, and shared without restriction. Open data sources are usually
government, non-profit, or academic repositories that provide data for research,
policy-making, and transparency.
137
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.
Video Link:
https://www.youtube.com/watch?v=nX_Xp2hVc0s&pp=ygUkQWNxdWlzaXRpb24gb2Yg
RGF0YSBpbiBlbmdsaXNoIG5wdGVs
138
THANK YOU
For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING
Data Visualization
(CSH-461)
141
1. Challenges in Data Storage and Retrieval
142
1.2 Retrieval Challenges
•Query Performance:
o Slow query execution due to inefficient indexing or large data volumes.
•Complex Queries:
o Handling complex queries involving multiple joins or aggregations.
•Concurrency:
o Managing simultaneous data access requests without conflicts or
bottlenecks.
•Heterogeneous Data:
o Retrieving and integrating data from different formats and sources.
143
Solutions:
144
•Applications:
o Real-Time Systems: Used in industries where milliseconds matter, such
as financial trading platforms, where orders need to be executed instantly.
o Data Analytics: Enables real-time dashboards and analytics by providing
immediate access to large datasets.
o Gaming: Powers multiplayer online games where player actions and
updates must be synchronized instantly.
•Examples:
o Redis: A widely used key-value store database that excels in speed and
supports complex data structures like lists, sets, and sorted sets.
o SAP HANA: Integrates in-memory storage with advanced analytics
capabilities for enterprise applications.
145
2. Retrieval and Query Languages
Retrieving data efficiently is a core function of database systems. Query
languages, particularly SQL, play a vital role in accessing, manipulating, and
analyzing data. The performance of data retrieval depends significantly on how
queries are written, optimized, and executed.
2.1 SQL (Structured Query Language)
SQL is the most commonly used query language for relational databases. It
provides a standardized way to communicate with databases.
146
•Indexing:
o Speeds up data retrieval by creating a data structure that maps keys to
their corresponding values.
Example: Indexing the "employee_id" column in an employee table allows
quick lookups.
o Types of Indexes:
B-Tree Index: Efficient for range queries and equality searches.
Hash Index: Optimized for exact matches.
•Partitioning:
o Divides large datasets into smaller, manageable parts based on specific
criteria (e.g., date, region).
o Enables faster retrieval by narrowing down the search scope to a specific
partition.
147
2.2 Advanced Querying Mechanisms
•Query Optimization:
o Query optimizers in DBMS automatically choose the most efficient way to
execute a query, considering factors like table size, indexes, and join
strategies.
o Example: Rewriting a query to use indexed columns instead of scanning the
entire table.
•NoSQL Query Languages:
o Designed for unstructured and semi-structured data stored in NoSQL
databases.
•Graph Query Languages:
o Used in graph databases like Neo4j, which store data as nodes and edges.
148
3. Integration of Solutions
Combining in-memory storage with advanced retrieval methods ensures that
modern applications can handle both high performance and complex querying
needs:
•Hybrid Systems:
o Many DBMSs allow for hybrid setups where frequently accessed data is
stored in memory while less critical data resides on disk.
o Example: MySQL's InnoDB engine uses a buffer pool to store frequently
accessed pages in memory.
•Caching:
o Frequently queried data is cached in memory to reduce retrieval times.
Tools like Redis and Memcached act as in-memory layers above
traditional databases.
o Example: Storing results of a product catalog query in memory for quicker
responses to user searches.
•Streamlining Analytics:
o Combining in-memory storage with SQL-based analytical tools allows real-
time insights, crucial for decision-making in businesses.
149
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.
Video Link:
https://www.youtube.com/watch?v=8suRVB5h5-w&pp=ygUaRGF0YWJhc2UgSXNzdWVz
IGluIGVuZ2xpc2g%3D
150
THANK YOU
For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING
Data Visualization
(CSH-461)
153
Linear regression models the relationship between the dependent variable YY and
one or more independent variables XX. It assumes a linear relationship:
Y = β0 + β1X + ϵY = β0 + β1X + ϵ
o Y: Dependent variable (output).
o X: Independent variable (input).
o β0: Intercept of the line.
o β1: Slope of the line.
o ϵ: Error term (captures the noise).
•Applications:
o Predicting housing prices based on square footage, location, and amenities.
o Estimating annual revenue for a company.
154
1.2 Polynomial Regression
Polynomial regression is an extension of linear regression for capturing non-linear
relationships.
155
The model fits a polynomial equation:
Y = β0 + β1X + β2X2 +…+ βnXn + ϵY = β0 + β1X + β2X2 +…+ βnXn + ϵ
•Applications:
o Modelling growth trends, such as population over time.
o Predicting temperature changes in weather forecasting.
156
2. Discontinuous Variables
Discontinuous variables, also known as categorical variables, are variables that
take on distinct, separate values. They may represent categories, groups, or
binary states. Examples include gender, job type, or product categories.
Prediction of Discontinuous Variables
Predicting discontinuous variables is referred to as classification if the output is
categorical and ordinal regression if the output has a natural order.
2.1 Logistic Regression
Logistic regression predicts the probability of a binary outcome.
157
The model transforms a linear equation using the sigmoid function:
P (Y=1∣X) = 11 +e− (β0+β1X) P(Y=1∣X) = 1+e − (β0+β1X) 1
o Outputs probabilities that are mapped to categories (e.g., 0 or 1).
•Applications:
o Predicting customer churn (Yes/No).
o Classifying tumours as benign or malignant.
158
o Nodes represent feature-based decisions.
o Leaves represent class labels.
o Splitting is based on measures like Gini Index or Information Gain.
•Applications:
o Predicting loan approval.
o Diagnosing diseases based on symptoms.
159
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.
Video Link:
https://www.youtube.com/watch?v=fWOV6n9nv7c
160
THANK YOU
For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING
Data Visualization
(CSH-461)
One of the key benefits of data visualization is its ability to simplify large
datasets. Instead of trying to interpret rows upon rows of numbers or text,
visualizations provide a clear and concise summary that can be understood at a
glance. This not only saves time but also reduces the risk of errors or
misunderstandings.
1. Bar Charts: A classic technique, bar charts are effective for comparing
different categories or displaying changes over time. They use vertical or
horizontal bars to represent data values. 163
164
2. A Box and Whisker Plot
A box and whisker plot, or box plot, provides a visual summary of data through its
quartiles. First, a box is drawn from the first quartile to the third of the data set. A
line within the box represents the median. “Whiskers,” or lines, are then drawn
extending from the box to the minimum (lower extreme) and maximum (upper
extreme). Outliers are represented by individual points that are in-line with the
whiskers.
165
3. Waterfall Chart
A waterfall chart is a visual representation that illustrates how a value changes as
it’s influenced by different factors, such as time. The main goal of this chart is to
show the viewer how a value has grown or declined over a defined period. For
example, waterfall charts are popular for showing spending or earnings over time.
166
4. Area Chart
An area chart, or area graph, is a variation on a basic line graph in which the area
underneath the line is shaded to represent the total value of each data point.
When several data series must be compared on the same graph, stacked area
charts are used. This method of data visualization is useful for showing changes in
one or more quantities over time, as well as showing how each quantity combines
to make up the whole. Stacked area charts are effective in showing part-to-whole
comparisons.
167
5. Pictogram Chart
Pictogram charts, or pictograph charts, are particularly useful for presenting
simple data in a more visual and engaging way. These charts use icons to
visualize data, with each icon representing a different value or category. For
example, data about time might be represented by icons of clocks or watches.
Each icon can correspond to either a single unit or a set number of units (for
example, each icon represents 100 units).
168
6. Highlight Table
A highlight table is a more engaging alternative to traditional tables. By
highlighting cells in the table with color, you can make it easier for viewers to
quickly spot trends and patterns in the data. These visualizations are useful for
comparing categorical data. Depending on the data visualization tool you’re
using, you may be able to add conditional formatting rules to the table that
automatically color cells that meet specified conditions. For instance, when using
a highlight table to visualize a company’s sales data, you may color cells red if
the sales data is below the goal, or green if sales were above the goal. Unlike a
heat map, the colors in a highlight table are discrete and represent a single
meaning or value.
169
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.
Video Link:
https://www.youtube.com/watch?v=4lxA7lo9GLU&pp=ygUnVGVjaG5pcXVlcyBmb3IgcG
xvdHRpbmcgZGF0YSBpbiBlbmdsaXNo
170
THANK YOU
For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING
Data Visualization
(CSH-461)
•Numerical Data:
o Suitability: Ideal for performing mathematical operations, statistical
analysis, and creating models.
o Examples: Financial data analysis, scientific experiments, engineering
measurements.
o Visualization Techniques:
Line Charts: Effective for showing trends over time or continuous data.
Scatter Plots: Useful for showing relationships between two numerical
variables.
Histograms: Great for displaying the distribution of a dataset.
•Categorical Data:
o Suitability: Best for grouping and categorizing information for comparison.
o Examples: Market segmentation, customer feedback analysis,
demographic studies.
o Visualization Techniques:
Bar Charts: Excellent for comparing different categories.
Pie Charts: Useful for showing proportions and parts of a whole. 174
•Time-Series Data:
o Suitability: Crucial for analysing and forecasting trends and patterns over
time.
o Examples: Economic indicators, climate data, sales performance.
o Visualization Techniques:
Time-Series Plots: Ideal for showing data points over time.
Line Graphs: Great for visualizing trends and changes over periods.
Area Charts: Useful for showing cumulative data over time.
•Spatial Data:
•Data Quality: Ensuring the accuracy, completeness, and reliability of the data.
•Data Format: The structure and format of data, affecting ease of processing and
analysis.
•Analysis Requirements: Specific needs of the analysis, such as level of detail
and precision.
•Visualization Needs: Choosing the best way to visually represent data to
communicate insights.
•Scalability: The ability to handle and process large volumes of data efficiently.
176
4. Choosing the Right Tool:
Video Link:
https://www.youtube.com/watch?v=4lxA7lo9GLU&pp=ygUnVGVjaG5pcXVlcyBmb3IgcG
xvdHRpbmcgZGF0YSBpbiBlbmdsaXNo
178
THANK YOU
For queries
Email: [email protected]