Chapter 3
Data Visualization
Sagar Kamarthi
Uses of Data Visualization
• Data visualization can be used in preprocessing data:
Data cleaning by finding incorrect values, missing values, duplicate
rows
Variable derivation and selection
Determining appropriate bin sizes
Combining categories
Determining useful variables
Uses of Data Visualization
• We focus on the purpose of Data Exploration.
• Data Exploration is a mandatory initial step whether or not
more formal analysis follows.
• Graphical exploration can be used in:
Understanding the data structure
Cleaning data
Identifying outliers
Discovering initial patterns
Generating interesting questions
Graphs for Data Exploration
• Basic plots:
Bar charts
Line graphs
Scatterplots
• Distribution plots:
Boxplots
Histograms
Variables in Boston Housing Dataset
• CRIM Crime rate
• ZN Percentage of residential land zoned for lots over 25,000 ft2
• INDUS Percentage of land occupied by nonretail business
• CHAS Does tract bound Charles River (=1 if tract bounds river, =0 otherwise)
• NOX Nitric oxide concentration (parts per 10 million)
• RM Average number of rooms per dwelling
• AGE Percentage of owner-occupied units built prior to 1940
• DIS Weighted distances to five Boston employment centers
• RAD Index of accessibility to radial highways
• TAX Full-value property tax rate per $10,000
• PTRATIO Pupil-to-teacher ratio by town
• LSTAT Percentage of lower status of the population
• MEDV Median value of owner-occupied homes in $1000s
• CAT.MEDV Is median value of owner-occupied homes in tract above $30000 (CAT.MEDV = 1) or not (CAT.MEDV = 0)
First Nine Records in Boston Housing Data
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO LSTAT MEDV CAT. MEDV
0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0 0
0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6 0
0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7 1
0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4 1
0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2 1
0.02985 0.0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 5.21 28.7 0
0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 12.43 22.9 0
0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 19.15 27.1 0
0.21124 12.5 7.87 0 0.524 5.631 100.0 6.0821 5 311 15.2 29.93 16.5 0
Basic Charts
• Support data exploration by displaying one or two columns of
data (variables) at a time.
• Useful in the early stages of getting familiar with the data
structure, the amount and types of variables, the volume and
type of missing values…
Bar Chart for Categorical Variable
• Comparing a single statistic across
groups
• The height of the bar represents the
value of the statistic
• The different bars correspond to
different groups
Line Graph for Time Series
• Showing time series
Raw Series
Zoom-in: First 2 Years
Aggregation– Monthly Average
Aggregation– Yearly Average
Different Aggregations of Data
Scatterplot
• Displays relationship
between two numerical
variables
Distribution Plots
• Boxplots and histograms
• Display the entire distribution of a numerical variable
• Useful in supervised learning for determining potential data
mining methods and variable transformations
Histograms
• A histogram represents the
frequencies of all x values
with a series of vertical
connected bars
• In this example, histogram
shows the distribution of the
outcome variable (median
house value)
Boxplots
• Top outliers defined as point above
Q3+1.5(Q3-Q1) Outliers
• Bottom outlier defined as point below
Q1-1.5(Q3-Q1) Uw
• Uw = minimum of non-outliers
Quartile 3
mean
• Uw = min{max(X), Q3+1.5(Q3-Q1) Median
Quartile 1
• Lw = maximum of non-outliers
• Details may differ across software Lw
• Lw = max{min(X), Q1-1.5(Q3-Q1)
Side-by-side Boxplots
• Side-by-side boxplots are useful
for comparing subgroups
• Boston Housing Example:
Display distribution of outcome variable
(MEDV) for neighborhoods on Charles
river (1) and not on Charles river (0)
Boxplots and Histograms (Cont.)
• Useful for prediction tasks
• Boxplots can also support unsupervised learning by displaying
relationships between a numerical variable and a categorical
variable
• Side-by-side boxplots are useful in classification tasks for
evaluating the potential of numerical predictors
• Cannot reveal high-dimensional information
Multidimensional Visualization
Matrix Plot 0
Matrix Plot
0.2 0.4 0.6 0.8 1
9
7.2
5.4
CRIM
101
3.6
1.8
• Shows scatterplots for
0
1
variable pairs
0.8
0.6
ZN
• Example: scatterplots for 3 102
0.4
Boston Housing variables
0.2
0
3
2.4
1.8
INDUS
101
1.2
0.6
0
0 1.8 3.6 5.4 7.2 9 0 0.6 1.2 1.8 2.4 3
Scatterplot with Additional Dimension
• Boston Housing example
• Scatterplot can be now used for
studying correlations between a
categorical outcome and
predictors
• Two numerical predictors color
coded by the categorical outcome
variable
Line Graphs with Additional Dimension
• The number of
deaths reported by
age in US in 2021
• A record number of
young people died
during the Delta
surge
Line Graphs with Additional Dimension
• The number of
deaths reported by
age in US in 2021
• Older age groups
continued to make
up the majority of
deaths
Aggregation and Hierarchies
• For a temporal scale, we can aggregate by different granularity.
• A popular aggregation for time series is a moving average, where the
average of neighboring values within a given window size is plotted.
Scatter Plot with Labels
• Better exploration of
outliers and clusters
Scatter Plot with Labels
• Covid-19 deaths compared
with state vaccination rates
in September 2021
• The graphic shows deaths
from Covid-19 since June
16, 2021, the day the United
States reached 600,000
deaths according to a New
York Times database. Data
is as of Sept. 29.
• Sources: New York Times database of reports from state
and local health agencies, Centers for Disease Control
and Prevention, U.S. Census Bureau
Scaling Up
• For large datasets, some effective alternatives aside from
aggregated charts could be:
Sampling
Reducing marker size
Using more transparent marker colors and removing fill
Breaking down the data into subsets
Using aggregation
Using jittering (slightly moving each marker by adding a small amount
of noise)
Example Without Jittering
Scaling: With Jittering
Scaling: With and Without Jittering
Multivariate Plot: Parallel Coordinates Plot
• A vertical axis is drawn for each variable
• Each observation is represented by drawing a line that connects
its values on the different axes
• Gives an indication of useful predictors and suggests possible
binning for some numerical predictors
Transformation
• The preprocessing step in data mining includes variable transformation
and derivation of new variables to help models perform more
effectively
• Transformation include changing the numeric scale of a variable,
binning numerical variables, condensing categories in categorical
variables…
Rescaling to Log Scale
Crowded Observable
Rescaling to Log Scale
Compressed Observable
Transforming Data
• Raw data is not always in a convenient form and modifying data can be
advantages before analysis.
• Example: variable Y is non linearly related to variable X1
X1
Transforming Data
• However , if X1 is reciprocated then X2 = 1/X1, which gives the following
linear relationship:
X2
Heat Maps
• Color conveys information
• In data mining, heat map is used to visualize
Correlations
Missing data
Data magnitude
Statistical distributions
Heatmap to Highlight Correlations
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
CRIM 1.00
ZN -0.20 1.00
INDUS 0.41 -0.53 1.00
CHAS -0.06 -0.04 0.06 1.00
NOX 0.42 -0.52 0.76 0.09 1.00
RM -0.22 0.31 -0.39 0.09 -0.30 1.00
AGE 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00
DIS -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00
RAD 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00
TAX 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
PTRATIO 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
B -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18 1.00
LSTAT 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37 -0.37 1.00
MEDV -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51 0.33 -0.74 1.00
• In Excel, use additional formatting to produce heat map
• Darker shades correspond to stronger correlation
• Correlation is measured by correlation coefficient
Heatmap to Highlight Correlations
https://towardsdatascience.com/annotated-heatmaps-in-5-simple-
steps-cc2a0660a27d
Heatmap of Seasonal Temperature for Albany
https://www.originlab.com/doc/Quick-Help/create-heat-mapDarker shades correspond to stronger correlation
Heatmap to Incidence of Measles in the US
https://www.royfrancis.com/a-guide-to-elegant-tiled-heatmaps-in-r-2019/
Heatmap of Unemployment Rate in Selected
Countries
Mapping America’s hospitalization and
vaccination divide
Heatmap to Missing Values in Database
• Heatmap of missing
values in a dataset
• Black denotes missing
value
Heatmap of Statistical Distributions
Statistical distribution of the values in a dataset column over time.
https://docs.honeycomb.io/working-with-your-data/heatmaps/
Multidimensional Visualization
• Two ways to convey richer information using basic plots:
Adding variables: color, size, shape, multiple panels, and animation…
Manipulations: rescaling, aggregation and hierarchies, zooming, panning, and
filtering…
• For additional categorical information, use hue, shape, or multiple
panels
• For additional numerical information, use color intensity or size
Parallel Coordinate Plot
CATMED =1
CATMEDV =0
Interactive Visualization
• Making changes to a plot is easy, rapid, and reversible.
• Multiple concurrent plots can be easily combined and displayed
on a single screen.
• A set of visualizations can be linked such that operations in one
display are reflected in the other displays
Linked Plots
Specialized Visualizations
Network Graph – eBay Auctions
• eBay sellers on left, buyers
on right
• Circle size = # of
transactions for the node
• Line width =# of auctions
for the buyer-seller pair
• Arrows point from buyer
to seller
Network Graph – eBay Auctions
Network Character co-
occurrence in Les Miserables
by Mike Bostock
https://www.d3-graph-gallery.com/network
Treemap – eBay Auctions
• Hierarchical eBay data:
Category> sub-
category> Brand
• Rectangle size =
average closing price
(=item value)
• Color = % sellers with
negative feedback
(darker=more)
Tree Maps of a Student’s Day Activities
© AHEAD Analytics Head Start - LLC
Tree Maps of Daily Food Sales at a Company Cafeteria
© AHEAD Analytics Head Start - LLC
Tree Maps of 2009 Benin Exports
© AHEAD Analytics Head Start - LLC
Tree Maps of Total Checkouts
© AHEAD Analytics Head Start - LLC
Map Charts of GDP and Well-Being Score
● Association between
countries’ well-being
and GDP
○ Darker shade indicates
higher value
○ Lighter shade indicates
lower value
© AHEAD Analytics Head Start - LLC
Map Charts Of Yale’s Community
● Map charts Yale’s
international
alumni, research,
and education
programs
© AHEAD Analytics Head Start - LLC
Map of USA with Pie Charts
● This map shows
the sales of garden
and home products
in different states
in the USA.
© AHEAD Analytics Head Start - LLC
Map of Asia with Pie Charts
● This map shows
the total
population and the
age distribution in
each country of the
world in 2005.
© AHEAD Analytics Head Start - LLC
Map of USA with Bubble Chart of 2014 US City Population
● Bubble chart of
2014 US city
population shown
on map of USA
© AHEAD Analytics Head Start - LLC
Map of Europe with Column Charts
● Column charts of
projects of
different scale
shown on the map
of Europe
© AHEAD Analytics Head Start - LLC
Summary: Prediction
1. Plot outcome on the y axis of vertical boxplots, vertical bar charts, and
scatterplots.
2. Study relation of outcome to categorical predictors via side-by-side
boxplots, bar charts, and multiple panels.
3. Study relation of outcome to numerical predictors via scatterplots.
4. Use distribution plots (boxplot, histogram) for determining needed
transformations of the outcome variable (and/or numerical predictors).
Summary: Prediction
5. Examine scatterplots with added color/panels/size to determine the
need for interaction terms.
6. Use various aggregation levels and zooming to determine areas of the
data with different behavior, and to evaluate the level of global versus
local patterns.
Summary: Classification
1. Study relation of outcome to categorical predictors using bar charts with
the outcome on the y axis.
2. Study relation of outcome to pairs of numerical predictors via color-
coded scatterplots (color denotes the outcome).
3. Study relation of outcome to numerical predictors via side-by-side
boxplots.
4. Use color to represent the outcome variable on a parallel coordinate
plot.
Summary: Classification
5. Use distribution plots (boxplot, histogram) for determining needed
transformations of the outcome variable.
6. Examine scatterplots with added color/panels/size to determine the
need for interaction terms.
7. Use various aggregation levels and zooming to determine areas of the
data with different behavior, and to evaluate the level of global versus
local patterns.
Summary: Time Series Forecasting
1. Create line graphs at different temporal aggregations to determine types
of patterns.
2. Use zooming and panning to examine various shorter periods of the
series to determine areas of the data with different behavior.
3. Use various aggregation levels to identify global and local patterns.
4. Identify missing values in the series.
5. Overlay trend lines of different types to determine adequate modeling
choices.
Summary: Unsupervised Learning
1. Create scatterplot matrices to identify pairwise relationships and
clustering of observations
2. Use heat maps to examine the correlation table
3. Use various aggregation levels and zooming to determine areas of the
data with different behavior
4. Generate a parallel coordinate plot to identify clusters of observations
Correlation Coefficient
n
∑ ( x(i ) − x )( y (i ) − y )
ρ ( X ,Y ) = i =1
1
n n
2 2
∑ ( ) ∑ ( )
2
x ( i ) − x y ( i ) − y
= i 1 =i 1
∑ ( x(i ) − x )( y (i ) − y )
ρ ( X ,Y ) = i =1
nσ X σ y
• Correlation coefficient varies between -1 and 1.
If it is close to 1, then variables X and Y are strongly correlated
If it is close to 0, then variables X and Y are not correlated
If it is close to -1, then variables X and Y are strongly correlated but in
opposite direction
Correlation Coefficient
ρ(X,Y) = 0 ρ(X,Y) = 1
Y Y
X X
ρ(X,Y) = -1 ρ(X,Y) = 0
Y Y
X X