MODULE II • But there are other important aspects to this
plot.
• For a start, these components are all in sensible
Complete Plots locations; the title is at the top and, very
importantly, the data symbols are at the correct
The most obvious feature of statistical graphics locations relative to the axes (and the scales
software is the user should be able to produce on the axes ensure that there is sufficient room
graphical output. for all of the data points).
In other words, the user should be able to draw • Some of these aspects are inevitable; no one
something. would use a program that drew data symbols
In most cases, the user will want to draw some in the wrong locations or created axis scales so
sort of plot consisting of axes, labels and data that none of the data could be seen.
symbols or lines to represent the data.
Figure below shows an example consisting of Trellis Plots
a basic scatterplot. • Trellis plots, also known as lattice or small-
This is one distinction between statistical multiple plots, are a type of data visualization
graphics software and a more general graphics technique used in statistics and data analysis.
language such as PostScript. The main idea behind trellis plots is to display
The user does not just want to be able to draw multiple smaller plots, or panels, arranged in a
lines and rectangles (though he may also want grid, with each panel showing a subset of the
to do that). data. This approach allows for a more detailed
exploration of the data across different
dimensions.
• A good example of a graphics system that
provides sensible defaults is the Trellis
system.
• The choice of default values in this system has
been guided by the results of studies in human
perception so that the information within a
plot will be conveyed quickly and correctly to
the viewer.
• The subtitle is also more heavily emphasized
• The user wants to be able to create an entire by using a bold font face.
plot. • The Trellis defaults extend to selections of
• To be even more explicit, the user wants to be plotting symbols and colors in plots of
able to draw an entire plot with a single multiple data series, which are chosen so that
command (or via a single menu selection). different data series can be easily
• All statistical software packages provide this distinguished by the viewer.
feature in one way or another (though they
may differ in terms of the range of different
sorts of plots that can be produced).
Sensible Defaults
• Sensible defaults in data visualization refer to
the use of pre-defined settings and
configurations that are likely to work well for a
wide range of situations. These defaults aim to
create clear, effective, and aesthetically
pleasing visualizations without requiring
extensive customization.
• The basic plot consists of a standard set of \
components: axes, labels and data symbols.
provides a complete set of graphical
parameters.
Customization • Examples of parameters that may sometimes
• Let us assume that your statistical software be overlooked are semitransparent colors, line
allows you to produce a complete plot from a joins and endings (Fig.) and full access to a
single command and that it provides sensible variety of fonts.
defaults for the positioning and appearance of
the plot.
• It is still quite unlikely that the plot you end up
with will be exactly what you want.
• For example, you may want a different scale
on the axes, or the tick marks in different
positions, or no axes at all.
• After being able to draw something, the next
most important feature of statistical graphics
software is the ability to control what gets
drawn and how it gets drawn.
Setting Parameters
• For any particular piece of output, there will be
a number of free parameters that must be
specified.
• As a very basic example, it is not sufficient to
say something like ‘I want to draw a line’; you
must also specify where the line should start
and where it should end.
• In order to fully specify the drawing of a single
straight line, it is necessary to provide not
only a start and end point, but a color, a line Arranging Plots
style (perhaps dashed rather than solid), how • Where several plots are produced together on a
thick to draw the line and even a method for page, a new set of free parameters becomes
how to deal with the ends of the line (should available, corresponding to the location and
they be rounded or square?). size of each complete plot.
• When producing plots, you deal with more • It is important that statistical graphics software
complex graphical output than just a single provides some way to specify an arrangement
line, and more complex graphical components of several plots.
have their own sets of parameters. • The following code demonstrates how
• For example, when drawing an axis, one rectangles, lines, polygons and text can be
parameter might control the number of tick added to a basic plot:
marks on the axis and another might control
the text for the axis label.
Graphical Parameters
• There is a common set of ‘graphical’
parameters that can be applied to almost any
graphical output to affect the appearance of the
output.
• This set includes such things as line color, fill
color, line width, line style (e.g. dashed or
solid) and so on.
• In order to be able to have complete control
over the appearance of graphical output, it is
important that statistical graphics software
Annotation • The title of a plot might be positioned halfway
across a page. That is, the title is positioned
• A more complex sort of customization
relative to a ‘normalized’ coordinate system
involves the addition of further graphical
that covers the entire page, where the location
output to a plot.
0 corresponds to the let edge of the page and
• For example, it can be useful to add an the location 1 corresponds to the right edge.
informative label to one or more data symbols
• The data symbols in a scatterplot are
in a plot.
positioned relative to a coordinate system
corresponding to the range of the data that only
covers the area of the page bounded by the plot
axes.
• The axis labels might be positioned halfway
along an axis. That is, the axis labels are
positioned relative to a ‘normalized’
coordinate system that only covers the area of
the page bounded by the plot axes.
• Many users of statistical graphics software
produce a plot and then export it to a format
which can be easily edited using third-party
• software (e.g. export to WMF and edit using
The first requirement for producing
Microsoft Office products).
annotations is the ability to produce very basic
graphical output, such as simple text labels.
• In this way, statistical graphics software needs
to be able to act like a generic drawing
program, allowing the user to draw lines,
rectangles, text and so on.
• In other words, it is not good if the software
can ‘only’ draw complete plots.
• The following code demonstrates how
rectangles, lines, polygons and text can be
added to a basic plot:
• This has the disadvantage that the coordinate
systems used to produce the plot are lost and
cannot be used to locate or size annotations.
• Furthermore, it makes it much harder to
automate or programmatically control the
Coordinate Systems annotation, which is essential if a large number
of plots are being produced.
One of the most important and distinctive
features of statistical graphics software is that Extensibility
it is not only capable of producing many
• The ability to produce complete plots, control
pieces of graphical output at once (lots of
all aspects of their appearance and add
lines, text, and symbols that together make up
additional output represents a minimum
a plot), but that it is also capable of positioning
standard for what statistical graphics software
the graphical output within more than one
should provide.
coordinate system.
• A more advanced feature is the ability to • A good example is that in PostScript or SVG
extend the system to add new capabilities, the current transformation applies to text as
such as new types of plots. well as all other shapes.
• In some respects, creating a new sort of plot is • In particular, if the current transformation
just an extreme version of customization, but scales output, all text is scaled.
there are two distinguishing features: you are
• This is not desirable when drawing a statistical
starting from a blank slate rather than building
plot because we would like the text to be
on an existing plot as a starting point (i.e. it is
readable, so in statistical graphics,
not just annotation) and, more importantly,
transformations apply to the locations of
extensibility means that the new plot that you
output and the size of shapes such as
create is made available for others to use in
exactly the same way as existing plots. rectangles and lines, but text is sized
separately.
• To be more explicit, in an extensible system
you can create a new menu item or function
that other users can access.
• For a start, the system must allow new
functions or menu items to be added, and
these must be able to be added by the user.
• The next most important features are that low-
level building blocks must be available and
there must be support for combining those
building blocks into larger, coherent graphical
elements (plots).
Building Blocks
Combining Graphical Elements
• What are the fundamental building blocks
from which plots are made? • In addition to allowing the user to compose
basic graphics shapes and position them
• At the lowest level, a plot is simply basic flexibly, a statistical graphics system should
graphical shapes and text, so these must be allow the user to ‘record’ a composition of
available. graphics shapes.
• In addition, there must be some way to define • For example, the user should be able to write
coordinate systems so that graphical elements a function that encapsulates a series of
can be conveniently positioned in sensible drawing operations.
locations to make up a plot.
• This does two things: the complete set of
Transformations in Statistical Graphics operations becomes easily available for other
• An important difference between people to use, and the function represents a
transformations in a general graphics language higher-level graphical element that can be used
and transformations in statistical software is as part of further compositions.
that statistical software does not apply Other Issues
transformations to all output.
This section draws together a number of issues that
• This arises from the difference between overlap with the production of static graphics.
statistical graphics and general graphics
images (art). 3-D Plots
Static 3-D plots have limited usefulness because 3-D
structures are often difficult to perceive without
motion. Nevertheless, it is important to be able to • There are many excellent pieces of software
produce 3-D images for some purposes. For example, a for converting between graphics formats,
3-D plot can be very effective for visualizing a which reduces the need for statistical graphics
prediction surface from a model. R provides only software to produce output in many formats;
simple functionality for drawing 3-D surfaces via the simply produce whatever format the statistical
persp() function, but the rgl add-on package provides graphics software supports and then convert it
an interface to the powerful OpenGL 3-D graphics externally.
system.
• Nevertheless, there are still some reasons for
Speed statistical graphics software to support
multiple formats.
• In dynamic and interactive statistical graphics,
speed is essential. Drawing must be as fast as • Some formats, especially modern ones,
possible in order to allow the user to change provide features that are unavailable in other
settings and have the graphics update in real formats, such as transparency, hyperlinks and
time. animation.
• In static graphics, speed is less of an issue; • It is not possible to convert a more basic
achievability of a particular result is more format into a more sophisticated format
important than how long it takes to achieve it. without adding information.
It is acceptable for a plot to take on the order
of seconds to draw rather than milliseconds. • When producing plots with R, it is advisable to
record the R code that was used to produce the
• This speed allowance is particularly important plot in addition to saving the plot in any
in terms of the user interface. For example, in ‘traditional’ formats such as PDF or
R a lot of graphics code is written in PostScript.
interpreted R code (which is much slower than
C code). This makes it easier for users to see • One important advantage with retaining such a
high-level format is that it is then possible to
the code behind graphics functions, to
possibly modify the code, and even to write modify the image using high-level statistical
graphics concepts.
their own code for graphics.
• Nevertheless, a limit is still required because • For example, an extra text label can be
the time taken to draw a single plot can be positioned relative to the scales on a plot by
multiplied many times when producing plots modifying the original R code, but this sort of
manipulation would be inconvenient,
of a large number of observations and when
running batch jobs involving a large number of inaccurate and hard to automate if you had to
edit a PDF or PostScript version of the plot.
plots. In R, complex plots, such as Trellis plots
produced by the lattice package, can be slow Data Handling
enough to see individual panels being drawn,
but most users find this acceptable. The entire • Statistical graphics software has largely
suite of figures for a medium-sized book can ignored the issue of where the data come
still be generated in much less than a minute. from.
• By separating data from graphics there is a
greater flexibility to present any data using
Output Formats any sort of graphic.
• When producing plots for reports, it is • However, we should acknowledge the
necessary to produce different formats importance of functionality for generating,
depending on the format of the report. importing, transforming and analyzing data.
• For example, reports for printing are best • Without data, there is nothing interesting to
produced using PostScript or PDF versions of plot.
plots, but for publication on the World Wide
Web, it is still easiest to produce some sort of • In an ideal situation, statistical graphics
raster format such as PNG. facilities are provided as part of a larger
system with data-handling features, as is the
case with R.
• Statistical graphics software should provide a
straightforward way to produce complete
plots.
Data and Graphs
• Graphs are useful entities since they can
represent relationships between sets of
objects.
• They are used to model complex systems (e.g.,
computer and transportation networks, VLSI
and Web site layouts, molecules, etc.) and to Graph Layout Techniques
visualize relationships (e.g., social networks, • The problem of graph drawing/layout has
entity-relationship diagrams in database received a lot of attention from various
systems, etc.). scientific communities.
• In statistics and data analysis, we usually • It is defined as follows: given a set of nodes
encounter them as dendrograms in cluster connected by a set of edges, identify the
analysis, as trees in classification and positions of the nodes in some space and
regression, and as path diagrams in structural calculate the curves that connect them.
equation models and Bayesian belief diagrams.
• Hence, in order to draw a graph, one has to
• Graphs are also very interesting mathematical make the following two choices: (i) selection
objects, and a lot of attention has been paid to of the space and (ii) selection of the curves.
their properties.
• Most graph drawing techniques use straight
• The various ways of visualizing a graph lines between connected nodes, but some use
provide different insights, and often hidden curves of a certain degree.
relationships and interesting patterns are
revealed. • The graph-drawing (or graph-layout) problem
is as follows.
• An increasing body of literature is considering
the problem of how to draw a graph. • Different types of graphs require different
algorithms for clean layouts.
• They consist of an unordered list of vertices
(node labels) and an unordered list of edges
(pairs of node labels).
• If a graph is connected, then we may receive
only a list of edges.
• If we have a weighted graph, the edge
weights may be used in the loss function used
to define the layout.
Hierarchical Trees
• However, graphs are also capable of capturing • Suppose we are given a recursive list of single
the structure of data commonly encountered in parents and their children.
statistics.
• In this list, each child has one parent and
each parent has one or more children.
• One node, the root, has no parent.
• This tree is a directed graph because the edge Figure below shows an example using data
relation is asymmetric. from a small website
• We can encapsulate such a list in a node class:
Node{
Node parent; NodeList children;
}
• A display for such a list is called a tree
browser.
• Creating such a display is easy.
• We simply walk the tree, beginning at the root, • Each node is a page and the branches represent
and indent children in relation to their parents. the links between pages; their thickness
represents traffic between pages (this website
• Suppose now we are given only a list of edges has no cross-links).
and told to lay out a rooted tree.
• It happens that the root is located near the
• To layout a tree using only an edge center of the display.
list, we need to inventory the parent–child
relationships. • This is a consequence of the force-directed
algorithm.
• First, we identify leaves by locating nodes
appearing only once in the edge list. • Adjacent nodes are attracted and nonadjacent
nodes are repelled.
• We then assign a layer value to each node by
• The springs algorithm brings to mind a simple
finding the longest path to any leaf from that
node. model of a plant growing on a surface.
• This model assumes branches should have a
Spanning Trees
short length so as to maximize water
• It makes sense that we might be able to lay out distribution to the leaves and assumes leaves
a spanning tree nicely if we approximate should be separated as much as possible so as
graph-theoretic distance with Euclidean to maximize exposure to sunlight.
distance.
Networks
• This should tend to place adjacent vertices
(parents and children) close together and push Networks are, in general, cyclic graphs. Force-
vertices separated by many edges far apart. directed layout methods often work well on
networks. There is nothing in the springs
• The most popular algorithm for doing this is a algorithm that requires a graph to be a tree. As
variant of multidimensional scaling called the an example, Fig. below shows an associative
springs algorithm. network of animal names from an experiment.
Subjects were asked to produce a list of
• It uses a physical analogy (springs under
animal names. Names found to be adjacent in
tension represent edges) to derive a loss
subjects’ lists were considered adjacent in a
function representing total energy in the
graph.
system (similar to MDS stress).
• Iterations employ steepest descent to reduce
that energy.
Laying out a Simple Tree
Treemaps
• Treemaps are recursive partitions of a space.
• The simplest form is a nested rectangular
partitioning of the plane.
• To transform a binary tree into a rectangular
treemap; for example, we start at the root of
the tree.
• We partition a rectangle vertically; each block
(tile) represents one of the two children of the
Directed Graphs root.
• Directed graphs are usually arranged in a • We then partition each of the two blocks
vertical (horizontal) partial ordering with horizontally so that the resulting nested blocks
source node(s) at top (let) and sink node(s) at represent the children of the children.
bottom (right).
• We apply this algorithm recursively until allthe
• Nicely laying out a directed graph requires a tree nodes are covered.
topological sort.
• The recursive splits alternate between vertical
• We temporarily invert cyclical edges to and horizontal.
convert the graph to a directed acyclic graph
(DAG) so that the paths-to-sink can be • If we wish, we may color the rectangles using
identified. a list of additive node weights.
• Then we do a topological sort to produce a • Otherwise, we may use the popular device of
linear ordering of the DAG such that for each resizing the rectangles according to the node
edge (u, v), vertex u is above vertex v. weights.
• After sorting, we iteratively arrange vertices • Figure below shows an example that combines
with tied sort order so as to minimize the color (to represent politics, sports, technology,
number of edge crossings. etc.) and size (to represent number of news
sources) in a visualization of the Google news
• Figure below shows a graph encapsulating the site.
evolution of the UNIX operating system.
• It was computed by the AT&T system of graph
layout programs.