1
TM351
Data management and analysis
2
Caveat
• These slides DO NOT replace the course learning
materials
• Exams WILL BE derived from the full set of the course
learning materials
3
SESSION 3
parts 5 and 6
4
PART 5
Presentation: telling the story
5
2 Data investigation: telling the story (or
stories)
• This part focuses on the final phase of the data pipeline:
the presentation of our investigation and its findings.
Figure 5.1 A data analysis pipeline
6
3 Finding the story
• in this module, a data investigation is a broader activity
than simple data analysis.
• involves working with one or more datasets to:
• explore a particular question, or
• identify one or more themes:
• to identify a trend
• to find correlations, or
• to discover anomalies.
• It may also include an interpretation or explanation of
what has been uncovered.
7
3 Finding the story
• may require data to be acquired and prepared, thus taking
in the entire data-processing pipeline.
• It may comprise several analyses, visualizations,
interpretations and reports.
• This story-based engagement with data can be seen as
having two main components:
• exploratory – finding the story within the data
• explanatory – presenting and illustrating the final results of the
exploration in a meaningful way.
• We will now briefly consider each of these.
8
3.1 Exploratory investigations
• The exploration of a dataset is essentially a conversation with it: we ask it a
question, consider the outcome and then identify further questions to clarify the
outcome. There are two kinds of conversation, however:
• The closed conversation. This formal style of questioning conversation is useful for
checking facts, looking for evidence, or identifying correlations. It will start with a set
of salient questions, frequently determined in advance, and remain focused on
getting the required answers, with little or no deviation, and possibly drawing
together the answers from different datasets. This closed-interview style is often
used with routinely prepared datasets, and hard-coded into report generators,
summaries and dashboards.
• The journalistic interview. Alternatively, a much more free-flowing conversation,
leading to novel findings, is possible. Subtly different ways of putting the same
questions may show subtle variations in outcome, and generate new questions.
Instead of looking for the facts to fit a particular story, the questioning uncovers
the story. Such an approach is hard to automate and requires experience in a
range of data manipulation and visualisation tools and techniques.
• Languages like Python and the IPython Notebook environment encourage the
conversational exploration of datasets, allowing data to be queried easily, and with
immediate results.
9
10
3.2 Explanatory investigations
• The explanatory story communicates the findings of the exploration.
Traditionally, this took the form of a written report, supported by tables and
graphics. Today a whole range of communication devices is available:
presentation slide decks, blog entries, interactive visualisations, etc.
However, whatever story telling devices are chosen, they must be capable
of delivering the story.
• But what does it mean to be able to tell stories with data?
• Kosara and Mackinlay (2013, p. 44) write
• At its essence, a story is an ordered sequence of steps, each of which can
contain words, images, visualizations, video, or any combination thereof. …
• In traditional stories, order roughly corresponds with time, which is crucial
for understanding causality: events that happen earlier can influence later
events, but not the other way around. Stories often are not told in a linear
fashion …. However, within each segment, the order must be
consistent – and the order of different segments clear – for the story to be
comprehensible.
11
3.2 Explanatory investigations
• When a chart forms the basis of the story, we are faced with an immediate
problem: how do we read it so as to be able to decode the ‘linear’ story it is
trying to tell us?
• On 24 June 1812, with an army of nearly half a million men, Napoleon
Bonaparte crossed the river Neman into Russia. His goal was Moscow.
Marching nearly 700 miles through trackless forests, across dirt roads, mires
and fens, the French entered Moscow on 14 September 1812 to find the city
deserted and stripped of all useful supplies. With no other option, Napoleon’s
army began a long retreat through the gathering Russian winter. Temperatures
dropped to around −40 °C. Frozen and starving, the French struggled back,
plagued by disease, desertions and suicides, and harried by Russian troops all
the way. By 22 December, the force had been ejected from Russia completely.
Fewer than 15,000 men remained.
• In 1869, Charles Joseph Minard, a French engineer, chose to present the story
of this calamitous campaign in a chart, which has been described as possibly
‘the best statistical graphic ever drawn’ (Tufte, 2001, p. 40). Figure 5.2
reproduces Minard’s map. Spend a few minutes trying to make sense of it.
12
Minard's map
13
3.3 Author-driven and reader-driven stories
• Edward Segel and Jeffrey Heer identify a continuum of
visualisation stories, from ‘author-driven’ at one extreme
to ‘reader-driven’ at the other. They describe author-driven
visualisations as:
• … a strict linear path through the visualization, [that] relies
heavily on messaging, and includes no interactivity.
Examples include film and non-interactive slideshows.
[This] approach works best when the goal is storytelling or
efficient communication … in comics, art, cinema,
commercials, business presentations, educational videos,
and training materials.
• Segel and Heer (2010, p. 1146)
14
Reader-driven approach
• With the huge increases in data processing and graphical
power harnessed in tools like IPython and Tableau, there
has been a growth in reader-driven stories in which the
reader is free to explore and re-explore the datasets in
different ways.
• Segel and Herr describe these as having:
• … no prescribed ordering of images, no messaging, and a
high degree of interactivity. … A reader-driven approach
supports tasks such as data diagnostics, pattern
discovery, and hypothesis formation.
• Segel and Heer (2010, p. 1146)
15
Warning against reader-driven approach
• Kosara and Mackinlay warn against an overabundance of
interactive features and free flow in story telling:
• There is a tradeoff between interaction and focus: the
former tends to distract from the story. Stories that
respond to and change based on interaction, such as by
selecting a particular part of the data or asking questions
that the user is interested in, are conceivable, but it is
unclear how to do this without the interaction causing
some form of interference.
• Kosara and Mackinlay (2013, p. 49)
16
Controlling reader-driven stories
• Sometimes the range of interventions available in some
interactive data visualizations may destroy the sense of
the story itself:
• which menu to select?
• Which radio button controls this?
• Why are multiple check boxes ticked?
• Means have to be found to control reader-driven stories.
17
Three approaches to story-telling
• Segel and Heer (2010, p. 1146) describe three different
approaches:
1. The Martini glass structure: begin with a tight narrative
exposition (the stem of the glass) and then tight narrative
exposition (the body of the glass). The exposition is
author-driven, the free exploration reader-driven.
2. The interactive slideshow: the author leads with their
story on each slide but then allows the reader the
opportunity to explore the content of each in more detail.
3. The drill-down story: an interactive visualization
containing multiple different stories, with the reader
choosing which one(s) to explore.
18
the Martini glass structure as a
preferred approach
• Philip Man (2011) suggests that the Martini glass structure
is the preferred narrative style of many newsrooms:
• there is a strong opening lead summarizing the essential
elements of the story, followed by background information,
then more specific details, a chronology of events and a
final, summarizing paragraph.
• Although we will not be covering the matter of how to
develop interactive data stories any further in this module,
it is worth noting that it is possible to create some simple
interactive visualizations within IPython Notebooks using
IPython widgets (Vanderplas, 2013).
19
3.4 Automating story telling
• Some publishers now offer ‘your child in a story’ interactives and printed
books.
• A customer supplies certain details about their child (name, age, gender,
some friends’ names, even a few photographs) and they get back a
‘personalised’ story book – for example ‘John and the Three Bears’.
• Of course, this is simple template filling, but provided the publishers don’t
mess up the template data (‘Smith, John and the Three Bears’), and provided
you don’t read too many copies of the same base template, the product can
be acceptable.
• Doing the same with data elements is a fairly trivial task, and writing rule-
based triggers to link these with particular phrases or paragraphs can easily
produce some of the regular news articles that fill news pages on quiet news
days.
"Today's stock market <rose|fell> by <xx>% following a day
of <heavy|light> trading with major loses in
<sector_down> and gains shown in <sector_up>."
• Activity 5.5 offers you a fuller example.
20
3.5 Not telling the story: data rhetoric
• The potential for the use of data stories to persuade, as
rhetorical devices, implies the reader has a duty to treat
data stories with care – the facts presented must be
accurate, and the story should (as far as possible) be
without known flaws or weaknesses.
21
4 Visualising the story
• Data visualizations make data accessible. They have long
been of paramount value in making relationships between
data items meaningful.
• But today, with the availability of fast processors and large
amounts of memory, users can create graphics in real
time, from data views they have created interactively.
Visualisation is now a dynamic research area.
• The phrase ‘data visualization’ automatically calls to mind
images of the pie charts, bar charts, line graphs and so
on, contained in the chart palette of many spreadsheet
applications.
22
4 Visualising the story – by tables
• But remember that a data table itself is a form of
visualization, especially as it is in the analyst’s power to
decide on the row and column order, and to select the
rows and columns to display.
• You have already met a number of tools, such as
aggregation and reshaping, for manipulating the
appearance of tables. And tables can also be elaborated
with meaningful text, coloring, and other forms of
typographical emphasis.
23
4.1 Why visualise data?
• So we use visualization to handle complex datasets – for
their
• Clarity
• Immediacy
• focused and summary representations of large datasets.
• Visualisation can quickly draw attention to the unusual or
unexpected.
24
4.2 Visualizations: exploratory and
explanatory
• We’re used to the idea that graphs and charts and other visualizations are
used to communicate a story, to explain features of a dataset (an
explanatory visualization);
• but in Part 4 we’ve also seen their use to support their exploration (an
exploratory visualization).
• A similar distinction is offered by Tom Steinberg (2012), who suggests
categorizing visualizations as follows:
• Story visualizations are produced to tell a story to an audience. Steinberg
gives examples of a newspaper graph showing deaths during a war, or a
map showing where within the country unemployment is highest.
• Answer visualizations are produced to supply an answer to a single
question posed by a particular person: for example, a manager asking for a
graph of product sales, or an engineer constructing a graph of stresses on a
bridge.
• Steinberg points out that in the internet age, answer visualization has
become so quick and common – for example the maps that pop up in search
engine results – that we sometimes don’t even notice them.
25
Four examples
• Of the four particular applications we’ll consider here,
namely
• to realize representations of datasets
• to communicate facts and information Explanatory
• to persuade and convince
• to support analysis activities Exploratory
• the first three are variations on the theme of explanatory
visualization; the fourth is a reconsideration of how
exploratory visualization contrasts with these.
26
Visualisation as representation
• Here, the aim is to give the content of the datasets
tangibility ( وضوحto use Tukey’s term). However, there
is no particular intention of telling a story.
• Scientific visualizations maintain principled relationships
between the data values and the visual variables
(position, scale, color. etc.).
• Visual properties – such as where each point is placed
on a chart or positioned relative to other points, how it is
colored, etc. – are determined in a well-defined and
regular way from a particular data value.
• Although these are often simple, sometimes the results
can be quite beautiful.
27
Figure 5.3 Plan of part of the solar system, showing the
main asteroid belt and the Greek and Trojan asteroids
28
Visualisation as representation
• Many simple graphics without descriptive text adornments
are also representational:
• point maps and bar charts often do little more than offer a
representation of the relative positions or sizes of data points.
• Labelling and ordering data for specific purposes, by drawing out
the detail to be explained, increases the explanatory power of the
visualization.
29
Visualizations to communicate facts and
information
• These visualizations are produced to illustrate or support a
case being made in the surrounding presentation or research
report: for example, as in Figure 5.4.
• Figure 5.4 Top six EU industry divisions in terms of level of
business R&D, and the three largest countries contributing to
each (2012)
• Source:
[Link]
on-research-and-development/2013/[Link]
• View long description
• These explanatory graphics are used to communicate
something very specific that has already been discovered, with
data and visualisation techniques carefully selected to ensure
accuracy and clear communication of the facts.
30
Figure 5.4 Top six EU industry divisions in terms of level of
business R&D, and the three largest countries contributing to
each (2012)
31
Visualizations to persuade
• ensure that both the immediate message of the
visualization supports the argument, and that under
additional scrutiny it remains valid and appropriate.
• This requires careful selection of the elements and
datasets visualized and in the use of the visual cues to
help us read the graphic.
• The chart in Figure 5.5 was intended to show that
opponents of the US Medicare program had wildly
overestimated its actual future costs.
• Figure 5.5 Projected costs of Medicare at various time
points
32
Figure 5.5 Projected costs of Medicare at various time
points
33
Tricks and distortions
• visualizations that aim to persuade are open to all kinds of
tricks and distortions.
Figure 5.6 A distorted pie chart
• The actual ratios of blue to purple to green is actually
[Link]).
• The 3-D perspective tilt appears to make the purple
segment larger than the blue.
34
The infographic
• The purposes of Infographics are:
• attracting interest
• communicating some key message, or
• to be entertaining.
• In extreme cases, the graphic attracts attention, but
simply communicates a false picture.
35
36
Visualisation in support of analysis
• Exploratory data visualization can help examine a dataset
in a wide variety of ways
• Visualizations are just sketches of the underlying data –
disposable, perhaps after prompting further questions.
37
4.3 Fundamentals of data visualization
• effective and truthful visualizations require a principled
mapping from data variables to visual variables.
• requires some grasp of the theories concerning the ways
in which humans interpret visual variables to construct an
understanding.
• three powerful ideas:
• Bertin’s visual dimensions
• the Gestalt theory of perception
• pre-attentive attributes.
38
Bertin’s visual dimensions
• The cartographer Jacques Bertin described eight visual
dimensions that apply to two-dimensional representations
(Bertin, 1983). The eight dimensions are:
• the two dimensions of space, x and y
• size
• colour
• value (intensity of colour)
• texture
• orientation
• shape.
• These are captured neatly in one of Bertin’s own graphics,
reproduced in Figure 5.7.
39
Figure 5.7 Bertin’s eight visual dimensions
40
The Gestalt theory of perception
• the human ability to perceive things as wholes, even when
presented with only partial views.
• When we perceive the whole, the individual parts take on
secondary importance.
• The Gestalt principles can account for the classic
face/vase dissonance (see Figure 5.8): at a given moment
we flip between seeing either a vase, or the two faces.
41
Figure 5.8 The face/vase dissonance تنافر
42
The Gestalt principles
• can help in constructing graphics to support exploratory
visualization, by emphasizing patterns within a scene, and
for influencing a reader to see the big picture and not the
parts. These are:
• proximity القرب
• similarity
• closure اغالق
• continuity
• symmetry
• common cause
• figure and ground.
• The first five of these are illustrated in Figure 5.9.
43
Figure 5.9 Five Gestalt principles
44
Gestalt principles
• We perceive objects as grouped on the basis of their
proximity or similarity,
• and we see two crossed lines – rather than (say) four
short lines meeting in the middle – on the basis of an
imagined continuity.
• Symmetry leads us to see a boundary, distinguishing
between the ‘inside’ and ‘outside’ of a shape.
• Closure, however, is perhaps the most interesting of the
five. Examine the circle closely and you will see that is not
a circle: it has a gap in its circumference, but we see the
circle. The human visual system tends to discount gaps in
favour of perceiving whole ‘closed’ objects.
45
Gestalt closure
Figure 5.10 shows two other demonstrations of closure:
• how many triangles do you see?
• do you see the rectangle that’s not really there?
46
Gestalt common cause
• The sixth principle, common cause (sometimes referred
to as common fate), can be seen when some elements
of an image collectively move in the same way and
appear to form a group distinct from the other elements.
This can also be demonstrated in the image in
Figure 5.11.
47
Figure 5.11 Common fate
48
Gestalt figure and ground
• Finally, the (use and misuse) of the principle of figure and
ground can be illustrated beautifully by Figure 5.12,
which we pass by with no other comment than to ask,
• ‘Which side are you on?’
Figure 5.12 Figure and ground
49
Summing up Gestalt
• Perhaps Gestalt itself is summed up in the nice graphic of
Figure 5.13.
Figure 5.13 Gestalt!
50
Pre-attentive attributes
• Attentive attributes are visual features that we perceive
only by processing the scene – this takes time and effort.
For example, how many 3s are there in the following
string?
• In fact there are ten, but you needed to look at each digit
in turn to decide if it is a 3 – this requires your attention.
How many 3s are there in the following string?
51
Pre-attentive attributes
• How many 3s are there in the following string?
• Or this one?
• The answer is still ten, in both cases, but they can be located and
counted much more quickly because they are perceived to stand out.
• This pre-attentive perception means that every digit in the sequence
no longer needs focused attention.
• Pre-attentive attributes allow us to establish focal points, and to draw
attention to specific elements in a sea of similar elements – a powerful
tool when using graphics for explanatory purposes.
• Look at how easily the pre-attentive attributes draw the attention in
Figure 5.14.
52
Pre-attentive attributes
• Form
• Colour
• Spatial position
• Movement
53
Figure 5.14 Some pre-attentive attributes
54
• Figure 5.15 illustrates how the pre-attentives of Colour
and Form:size, together with Gestalt principles of
continuity and proximity, have been mixed to tell a story.
55
Natural visualizations: a top-down approach
• based on a knowledge of the visual dimensions, Gestalt
principles and pre-attentive processing, we can apply one
of two approaches to visualization:
• a bottom-up approach, works from the data towards the
graphic.
• a top-down strategy starts with some common graphic
forms, and then tweak them
56
Natural visualizations: a top-down approach
• The top-down strategy is usually quicker, but may not permit the
control and veracity that bottom-up approaches afford.
• Nevertheless, it may be useful when the data has a natural
visualization. There are at least three ways to interpret the term
‘natural’ here:
1. The data might ‘naturally’ be thought of, or experienced, in a
particular visual way: ‘location data’, would probably lead one to
think of a map.
2. A particular data variable might ‘naturally’ be represented on a
particular type of visual scale. For example, a continuously varying
quantity might naturally be represented as a line.
3. A dataset has a particular shape. There are many visualization
tools capable of straightforwardly plotting particular sorts of shape:
many visualization tools can easily generate a wide range of
graphics from tabular data, for instance.
57
Natural visualizations: a top-down approach
• Many natural visualizations can be generated using tools
and libraries accessible from within an IPython Notebook,
as well as from bespoke data-visualization tools such as
Tableau.
• Activity 5.11 will help you to get an appreciation for how
visualizations can be constructed to reveal a dataset’s
stories, and a feeling for how particular data shapes might
be construed graphically.
58
Bottom up: The Grammar of Graphics
• Grammar of graphics (GoG) is a fresh approach to
building a meaningful graphic from the bottom up.
• Select a graphic from a palette, and then tweak the
graphic to illustrate the data
• In Leland Grammar of Graphics (GoG), the analyst starts
with the data elements and then generates an appropriate
visualization of them, using GoG.
• GoG permits building graphics from data values.
59
GoG
Seven components:
1. Data, and operators for reshaping it.
2. A coordinate system, for computing positions on the 2-D plane of
the plotting surface – usually the Cartesian coordinate system.
3. Geometric objects used to display the data values, e.g. points,
lines, bars.
4. Scales that map variables to aesthetic properties (such as colour
and size) of geometric objects.
5. Plot annotations that make it possible to read data values from
the graph.
6. Transformations that create new variables from functions of
existing variables, e.g. log- transforming a variable.
7. Statistics that optionally summarise the data. Statistics are critical
parts of certain graphics (e.g. the bar chart and histogram).
60
GoG Example
• Consider the following trivial dataset.
• Assume that we are interested in visualizing the relations
between the Value 1s and Value 3s of each of the two
Types by means of a scatterplot in which the Value 1s will
be plotted on the horizontal, x-axis and the Value 3s on
the vertical, y-axis. The shape of the points would
represent the type, ‘p’ or ‘q’, of each pair of values.
61
GoG example (cont.)
• The first step would require an operation to reshape the
data to remove the unnecessary Value 2s and re-label the
column headers as shown in Table 5.2.
62
GoG example
• Other elements of the grammar can then be used as
follows:
• Using the scales component, a linear scale and a
Cartesian coordinate system can be specified.
• The coordinate component maps the data values in
Table 5.2 to coordinates on the drawing area.
• The geometric objects representing the two Types (in this
case we’ll choose a blue circle and a red square) are
selected.
• Plot annotations are then specified to decorate the chart.
63
GoG example
• This gives Table 5.3 and three sets of elements,
generated by the grammar, and illustrated in Figure 5.16
(a), (b) and (c).
Figure 5.16 Elements generated by the grammar
64
Figure 5.17 The final graphic
65
Implementation of GoG
• the ggplot2 library developed by Hadley Wickham for the
R programming language.
• no stable implementation of ggplot2 for Python
• Seaborn library borrows ideas from it.
• If you are interested, and have time, you might like to try
optional activity 5.13.
66
4.4 Non-visual ‘visualizations’
• Tactile graphics. graphical data is conveyed by Braille-like raised
surfaces
• Tactile maps have their own versions of Bertin’s visual dimensions,
discussed earlier. May include size, shape, texture, height and
vibration.
• Sonification is the process of mapping data to (non-speech)
sounds, using changes in tempo, amplitude, and frequency.
• Example: the relation between the rate of clicking of a Geiger
counter and the level of radiation.
• Both tactile graphics and sonfication can work well with the GoG
approach. The structure of the data is expressed in the one
abstract GoG format as in the example given above, but,
depending on the intended audience, the construction and
rendering builds to sounds or raised contours, rather than to visual
objects.
67
68
4.5 Working with visualizations
• Requires experience, trial and error, and exploration.
• Outline workflow for producing an adequate graphic:
• Step 0: employ an experienced data engineer and a data
visualization designer – otherwise move straight to Step 1
• Step 1: know your data and your story
• Step 2: choose which basic graphic forms to consider
• Step 3: try out the basic forms
• Step 4: adjust the elements
• Step 5: let others critique your choices – and learn from
the feedback.
Figure 5.18 A decision tree for several basic chart types
69
70
Summary of the 5-step approach
• This five-step approach will allow you to develop an
understanding of what works well, what doesn’t, and what
will work with particular data and stories.
• Many internet sites, data visualization communities and
consortia online also offer general advice on what charts
work in particular situations, and offer ‘common pitfalls’
lists.
71
5 Reporting the story
• A data investigation must include a record of activities, not
just of findings. Consequently, the typical investigation will
generate two documents, with different requirements:
• the lab notebook
• the data investigation report.
72
5.1 The lab notebook
• A diary-like record of data analysts’ activities as they
moved through the data-analysis pipeline
• Records the data considered, techniques applied and
results obtained, as well as their reflections, observations
and partial conclusions.
• will be the principal source of information for the final
report
• will contain a much greater level of detail, and will record
aspects of the investigation, such as dead-end
explorations, that would not appear in a final report.
73
5.1 The lab notebook
• The investigation may generate many other outputs
• Summaries
• news articles
• blog posts
• assignment reports
• Recommendations
• Posters
• Graphics
• opinion pieces
• Slideshows
• Infographics
• tweets, etc.
• These are all communications which can be used as
background to the final report, or which emerge from it.
74
5.2 The report
• The final report on a data investigation tells two stories:
• the story of the investigation itself
• the story discovered in the data.
• The data investigation report is an account, in an idealized
form, presented after the investigation is complete.
• It will allow others to verify your work and build on it, to
dispute your findings, and to make decisions on the basis
of them.
• You may be challenged to defend your findings, and the
quality, or otherwise, of your report will go a long way
towards lending them credibility.
75
An outline report structure
• a few key issues concerning the report :
• The report is a communication tool; as such, it needs to
be clear, focused and understandable.
• The focus of the report is to present the story of the
investigation. If it is also required to recommend,
persuade or educate its readership, then additional
sections or outputs specifically for those purposes may be
necessary.
• A successful outcome is not the only one worth reporting:
a failure to identify the desired information, or an
investigation too ambitious for the data available, or other
problems, are reportable as well.
76
general points about writing style:
• Keep it simple
• Keep it focused
• Consider the intended readership
• Begin with a summary
• Support the reader through long reports
77
5.3 Using IPython Notebooks for data
investigations
• IPython Notebooks can be used to meet the requirements
of both:
• the lab notebook and
• the data investigation report.
IPython Notebooks as lab notebooks
• For a short or informal investigation or analysis task, the
whole report could be completed in a single IPython
Notebook, by extending the lab notebook structure.
78
Structuring the IPython lab notebook
The Notebook can be structured as follows:
• a starting section recording your initial aims, data sources,
scope etc.
• followed by a linear record of the activities followed, along
with notes, observations and reflections, interim findings,
records of additional datasets, etc. This main section will
probably reflect some workflow or data-analysis pipeline
model, or repeated application of fragments of the
pipeline.
• You can finish with a concluding section outlining the
findings and the story you managed to find and any notes
for future work.
79
IPython Notebooks for the data investigation
report
• The same ‘complete’ and ‘executable’ features that make
the IPython Notebooks useful as lab notebooks also make
them useful vehicles for a final report.
• The Notebook can contain the same structure and content
as a traditional report, but it can also bind actual data into
the central workflow and make it executable.
• This allows a reader to follow the main story of the
investigation by:
• replaying the executable cells (an author-driven story), or
• to clone the report and then pursue the same investigation with
different datasets, visualisations or processing (a reader-driven
story).
80
6 Summary
• In this part, we considered the broad activity of data
investigation as a means to find a story that a dataset might be
telling. Two activities were associated with the data story:
• exploratory, in which analysts elucidate the story to themselves
• explanatory, in which analysts tell this story to stakeholders.
• In this context we discussed at length how visualisations could
be utilised in both of these activities, and considered various
principles, methods and tools by means of which effective
graphical illustrations could be generated.
• The explanatory component of a data investigation really
constitutes the final phase of the data pipeline, so we ended
with a discussion of structures for effective reporting of the
findings.
81
ACTIVITIES
For part 5
82
Activity 5.1 Exploratory
• 10 minutes
• Read through Derek Willis’ blog post ‘
Take an Interviewing Approach to Find Stories in Data’.
• How does Willis view a dataset? What steps does he suggest
taking in order to get stories from the dataset?
• Discussion
• Willis treats the dataset in the same way a journalist would treat a
person they were interviewing for a story. This starts with the
background work to probe the data, and other stories around it,
and then proceeds to a ‘getting to know you’ style of interview
(probably using a descriptive analysis), including identifying the
‘dirty’ data for cleaning. Exploratory conversations can then take
place, in which the unexpected or interesting can be identified.
83
Activity 5.2 Exploratory
• 20 minutes
• Read quickly through the ‘
Quick investigation – road traffic accidents in Milton Keynes’
Notebook.
• How do you think the conversation with the featured data could
progress?
• Discussion
• Willis treats the dataset in the same way a journalist would treat a
person they were interviewing for a story. This starts with the
background work to probe the data, and other stories around it, and
then proceeds to a ‘getting to know you’ style of interview (probably
using a descriptive analysis), including identifying the ‘dirty’ data for
cleaning. Exploratory conversations can then take place, in which
the unexpected or interesting can be identified.
84
Activity 5.3 Exploratory
• 10 minutes
• Now watch the following narration of the story.
• Active content not displayed. This content requires
JavaScript to be enabled, and a recent version of
Flash Player to be installed.
• The Greatest Ever Infographic (James Grime,
Numberphile)
85
Activity 5.4 Exploratory (optional)
• 30 minutes
• If you have time, try some of these examples of data-
driven interactives. They should give a feel for a range of
types of interaction. Some are standalone items, others
would be used as illustrations in, or a complement to, a
longer explanatory article.
• The Guardian
Afghanistan war logs: our selection of significant incidents
• The Guardian Mapping the riots with poverty
• The New York Times
Dissecting a Trailer: The Parts of the Film That Make the
Cut
86
Activity 5.5 Exploratory
• 10 minutes
• Read the article ‘
The First News Report on the L.A. Earthquake [17 March 2014] Was Written by a
Robot
’.
• What type of conversation with the data do you think was used to generate the
data for the stories in this example?
• In what ways do you think that automatically generated charts and text can be
used together?
• Discussion
• This is a very closed, interview-style conversation. A fixed set of standard
comparisons and processing generates values that triggered pre-written text
passages, and fill slots in standard templates.
• Many written news reports present key, headline information and then show the
wider context and background to the headline. A rule-based process could be
developed to select appropriate visualisations to accompany standard story
templates: a graphic of the comparative sizes of the past LA earthquakes, or of
other quakes worldwide that year.
87
Activity 5.6 Exploratory XX
• 20 minutes
• Read through Sections 5.3 (‘In defence of statistics’) and 5.4 (‘Tactics of
persuasion’) of Ray Corrigan’s Open University OpenLearn Works unit Law, the
internet and society.
• In defence of statistics
• Tactics of persuasion
• How does Corrigan suggest that statistics can be both denigrated and defended?
What rhetorical techniques does he suggest you guard against when an author is
arguing a cause?
• Discussion
• Statistics can be misapplied, or misinterpreted or applied only to carefully selected
segments of datasets, resulting in inaccurate or inappropriate results.
• Corrigan supplies a long list of rhetorical tricks used to persuade an audience:
emotional appeal, sarcasm and innuendo, obfuscation, appealing to unverified
expert witnesses, reducing counter arguments to their absurd extremes, choosing
to present only supporting evidence, quoting out of context, drawing illogical
conclusions from sound evidence, and repetition.
88
Activity 5.7 Exploratory
• 10 minutes
• Now read the OpenLearn post ‘
When the data you don’t collect affects the data you do’. What
statistical error, or trap for the unwary, is described? Can you
identify other examples where such an error may have occurred?
Please share them in the forums.
• Discussion
• This is a good example of sample bias. The available sample told
one story, that aircraft were being hit in specific locations. However,
this was a distorted sample: some aircraft were unavailable due to
accident, or enemy action.
• The non-returning aircraft were being hit in places that prevented
their return – in other words, the very thing that the analysis was
intended to identify distorted the available sample.
89
Activity 5.8 Exploratory
• 20 minutes
• The ‘standard’ suite of statistical charts – pie charts, line charts, etc. –
represent only a few of the possible chart types that are available through
powerful charting tools. Glance through the visualisation types shown at the
following sites, to get a feel for
• the types of visualisation
• the way in which they are described.
• The Data Visualisation Catalogue
• A Periodic Table of Visualization Methods
• Don’t spend more than 20 minutes looking at these.
• Discussion
• Some of these are highly specialist representations, found in particular areas of
business or data analytics, science or engineering. Others are quite common,
although you may have noticed variations in their names, or descriptions; as in
all complex, evolving areas there can be a degree of variation in terminology.
90
Activity 5.9 Exploratory
• 5 minutes
• To get an idea of what we mean here, experiment with the
following example of this kind of macroscopic and
microscopic interactive visualisation, in which data
artist Eric Fischer worked with the Gnip team to create a
fully browsable worldwide map of local Twitter
connections and loyalty.
• [Link]
415/-0.7400
• On opening, the visualisation is centred on Milton Keynes,
but you can zoom and shift focus across a worldwide
map.
91
Activity 5.10 Notebook
• 10 minutes
• Work through Notebook 05.1 Anscombe’s Quartet -
visualising data.
92
Activity 5.11 Notebook
• 30 minutes
• Work through Notebook 05.2 Getting started with maps -
folium.
93
Activity 5.12 Social (on-going)-XX
• If you know examples that illustrate natural
representations, or that are particularly inappropriate for
visualising a particular data type (perhaps because they
are misleading, perhaps because they are just
nonsensical in that form), please post them to the
‘“Natural” visualisations’ forum thread.
94
Activity 5.13 Exploratory (optional)
• 20 minutes
• Read through the blog post ‘
A Simple Introduction to the Graphing Philosophy of ggplo
t2
’ which describes some key properties of Hadley
Wickham’s ggplot2 R implementation of the Wilkinson’s
The Grammar of Graphics.
95
Activity 5.14 Notebook
• 40 minutes
• Work through Notebook 05.3 Getting started with
matplotlib to see how this powerful charting package can
put you in control of your visualisations.
96
Activity 5.15 Self-assessment
• 5 minutes
• Re-read Philip Guo’s blog post ‘First impressions of the IPython
Notebook’. You first met this blog posting when you worked through
the Module Guide. Remind yourself of the reasons why he says he
switched to using IPython Notebooks to record his data analysis
activities.
• Answer
• A Notebook can capture – immediately – the reasons why you planned
to do something, what you did, what happened, and any reflections on
what this told you. IPython Notebooks are self-contained and
executable, so it is possible to run and re-run the same processes with
different datasets, or clone and explore additional processing or
variations in the activities. Guo states ‘Everything related to my
analysis is located in one unified place.’ – which is probably as good as
it gets for something to work as a lab notebook!
97
EXERCISES
For part 5
98
Exercise 5.1 Exploratory
• 10 minutes
• Read the following quotes from John Tukey:
• The greatest possibilities of visual display lie in vividness and inescapability of the intended message. A visual display can stop
your mental flow in its tracks and make you think. A visual display can force you to notice what you never expected to see. (“Why,
that scatter diagram has a hole in the middle!”) On the other hand, if one has to work almost as hard to drag something out of a
visual display as one would to drag it out of a table of numbers, the visual display is a poor second to the table …
• Another important aspect of impact is immediacy. One should see the intended at once; one should not even have to wait for it to
gradually appear. If a visual display lacks immediacy in thrusting before us one of the phenomena for whose presentation it had
been assigned responsibility, we ought to ask why and use the answer to modify the display so its impact will be more immediate.
• Tukey (1990, p. 328)
• Why do we use pictures? Most crucially to see behavior we had not explicitly anticipated as possible – for what pictures are best
at is revealing the unanticipated; … making it easier to perceive and understand things that would otherwise be painfully complex.
…
• When we can summarize matters … simply in numbers, we hardly need a picture and often lose by going to it. When the simplest
useful summary involves many more numbers, a picture can be very helpful. …
• The main tasks of pictures are then:
• to reveal the unexpected,
• to make the complex easier to perceive.
• … The more we feel that we can “taste, touch, and handle” the more we are dealing with a picture. Whether it looks like a graph,
or is a list of a few numbers is not important. Tangibility is important – what we strive for most.
• Tukey (1975, p. 524–5)
• Based on these quotes, why does Tukey think we should make use of data visualisations?
• Discussion
• Tukey says graphical visualisations can offer clarity, an ability to focus the attention, and (when properly done) an immediacy that
may not appear in the base numbers. He identifies pictures as best at ‘revealing the unanticipated’ as part of a process of
simplifying the complex.
99
Exercise 5.2 Exploratory
• 5 minutes
• Consider the Twitter loyalty map in the previous activity. Is this being used as an
explanatory visualisation? In what ways could it be used as an exploratory map?
• Discussion
• The visualisation certainly shows vividly the mix of ‘local’ and ‘visitor’ tweets in
locations world-wide. And there is an attempt to show location data based on the
map view – although when I tried it the map and reported location were often out
of sync. However, none of this really explains why particular locations have more
visitor than local activity. The identification of features such as airports, major
cities and main transport routes might make it possible to make that link.
• As an exploratory visualisation, the graphic allows users to distinguish between
different areas and features on the maps, and shows up visitor hotspots which
would prompt one to zoom in and investigate. For example, one of these turns
out to be Wembley Stadium, and of course motorway services tend to show up
as visitor hotspots.
100
Exercise 5.3 Self-assessment
• 5 minutes
• What role do you think visualisation might play at each stage of the
data analysis pipeline?
• Discussion
• Here are some suggestions:
• Acquisition. To check the completeness and scope of the data; to
ascertain its overall shape.
• Preparation. To find and identify outliers (which may require cleaning);
to locate areas where the data is sparse or dense; to examine binning
and segmentation strategies.
• Analysis. To examine the results of analysis activities such as
clustering, classification and correlation.
• Presentation. To generate explanatory graphics displaying final
findings.
101
Exercise 5.3 Self-assessment
• 5 minutes
• What role do you think visualisation might play at each stage of the
data analysis pipeline?
• Discussion
• Here are some suggestions:
• Acquisition. To check the completeness and scope of the data; to
ascertain its overall shape.
• Preparation. To find and identify outliers (which may require cleaning);
to locate areas where the data is sparse or dense; to examine binning
and segmentation strategies.
• Analysis. To examine the results of analysis activities such as
clustering, classification and correlation.
• Presentation. To generate explanatory graphics displaying final
findings.
102
Exercise 5.4 Self-assessment
• 5 minutes
• How do you think the Gestalt principles might be useful
when engaging in data visualisation, and why?
• Discussion
• Proximity effects might help find groups or clusters;
similarity may be useful when using shape to identify
members of different groups; continuity can be used for
highlighting trends and outliers from trends; symmetry
may be useful (in association with continuity) for
distributions. Closure allows us to visualise incomplete
patterns, where perhaps data points may be missing.
103
Exercise 5.4 [5.5?] Self-assessment
• 5 minutes
• What story (or stories) do you perceive in Figure 5.15? How are
they shown?
• Discussion
• At least two (perfectly compatible) stories are possible here. The
first thing to notice, perhaps, is the cluster (Gestalt proximity) of
small (pre-attentive size and colour) bubbles near the bottom of the
chart, indicating the relative poverty of sub-Saharan countries, as
opposed to the relative wealth of the economies of the Americas,
Europe, East Asia and the Middle East. Note that one or two East-
Asian economies seem really to stand out.
• However, a quite different story might be that a linear trend (Gestalt
continuity and common cause) can easily be detected, indicating an
apparent relationship between national wealth and life-expectancy.
104
Exercise 5.6 Exploratory
• 5 minutes
• Take a few minutes to think about what you would put in a general investigation report. The
content of your report would support the following questions:
• What did I set out to do?
• What (data) did I use?
• What did I do?
• Why did I do it?
• What effect did it have?
• What did I find out?
• (What would I do differently?)
• What conclusions/recommendations can be drawn?
• Now sketch the headings of a possible investigation report, based on these questions and
on the discussions above. For each heading, add a few notes on what would be included
under that heading.
• Discussion
• The file TM351_Report_Outline.docx contains one possible general-purpose reporting
structure. It might need to be adapted for specific requirements, or for different readerships,
but it does cover the key aspects of a data investigation report.
105
PART 6
With data comes responsibility
106
Overview of Part 6
Contents
1 Introduction
Aims
Workload
With data comes responsibility
2 The legal framework
2.1 UK Data Protection and Freedom of Information Acts
2.2 Related UK legislation
2.3 Data protection law in other countries
2.4 The consequences of differing perspectives on data protection
3 Managing personal data
3.1 Primary responsibilities
3.2 Preventing disclosure
3.3 When personal data must be published
4 Identity and (re-)identification
4.1 The notion of identity
4.2 Identification
4.3 Anonymisation
4.4 The fallibility of anonymisation: mosaics, rainbows and triangulation
4.5 Responsibilities following re-identification
5 Case study: the challenge of anonymisation
5.1 Scenario
5.2 Data release
5.3 Other data releases
5.4 Publicly available data
Note: Red-marked sections are optional
5.5 A sting in the tail
5.6 Conclusions
6 Summary
107
Aims
• After studying this part of the module you will be able to:
• explain the scope and applicability of legislation relevant to
data management
• navigate and apply the guidance available from sources
such as the Information Commissioner’s Office (ICO)
• evaluate data management practice against UK data
protection requirements
• discuss the ethical issues surrounding the management of
personal data
• describe the need to protect identity when sharing or
disclosing data
• explain the principles and challenges of effective data
anonymisation.
108
With data comes responsibility
• There are fundamental legal, professional and ethical issues
to be considered when managing and manipulating data
especially if the data refers to identifiable people, or represent
commercially or otherwise sensitive information.
• In this part, we explore:
• the UK legal context for processing data
• the professional and ethical to mitigate the potential impact of the
limitations of the legal requirements
• The area is complex, and is best navigated by a lawyer
specializing in IT law.
• However, you need to be aware of:
• the relevant legislation
• the underlying principles
• the various guidance notes that interpret the legal framework.
109
2 The legal framework
• We refer primarily to:
• UK laws relating to data protection and freedom of information
• European Union (EU) principles and directives that those laws
enact
• Aspects of data management covered by legislation
include:
• Data protection (covered)
• Freedom of information (covered)
• Computer misuse
• Copyright
• Fair use
• Licensing
• Collection, retention and use of personal data for investigative
purposes.
110
2.1 UK Data Protection and Freedom of
Information Acts
Data Protection Act key aspects (Guide to Data Protection (GTDP) (
ICO, 2015a)):
• the definition of ‘personal data’
• the definition of ‘sensitive personal data’
• the definitions of ‘data subject’, ‘data controller’, ‘data processor’
and ‘third party’
• the data protection principles
• the restrictions on disclosure
• key exemptions from the disclosure provisions.
Freedom of Information Act key aspects
Guide to Freedom of Information (GTFOI) (ICO, 2015b), :
• the scope of the act
• how it interacts with the Data Protection Act.
111
2.2 Related UK legislation
• UK law often evolves as a result of multiple small
amendments to the relevant Act that are enacted as:
• part of completely separate pieces of legislation, or
• by means of secondary legislation (by the executive branch) known
as statutory instruments.
• Once enacted, legislation and its application and
interpretation may be the subject of consultation over a
protracted period.
• Documents relating to specific consultations are normally
published on the ICO website.
112
2.3 Data protection law in other countries
• The UK legislation was updated in to take account of the
1995 European Data Protection Directives and
subsequent regulations (Further reading).
• One consequence: is the addition of the principle:
• Personal data shall not be transferred to a country or
territory outside the European Economic Area unless that
country or territory ensures an adequate level of
protection for the rights and freedoms of data subjects in
relation to the processing of personal data. GTDP (
ICO, 2015a, p. 15)
• US laws are considered as "very different" from UK laws
• Brexit will have some impact on UK legislation
113
2.4 The consequences of differing
perspectives on data protection
• Personal data may be transferred from an EU country to
the US, but only to organisations that have signed up to a
set of principles that looks quite similar to those specified
in EU law.
• But, is there any need to be worried by the differences in
approach?
• Activity 6.7 will clarify this question
114
3 Managing personal data
3.1 Primary responsibilities
• Primary responsibilities for processing personal data:
1. Identity
• Ensuring that the identity of the data subject is
unambiguous and correct.
2. Accuracy
• Ensuring that the data about the data subject is – as far
as is reasonably possible – correct, current and – where
completeness is relevant for the declared processing
‘purposes’ – complete.
3. Security
• Ensuring that the personal data is kept securely, that it is
not available to either casual or malevolent snooping.
115
3 Managing personal data
3.1 Primary responsibilities
• Primary responsibilities for processing personal data:
4. Disclosure
• Ensuring that the personal data is not disclosed, in a form
in which the data subject(s) may be identified, to anyone
outside the organization responsible for the data, or, in
particular, to anyone in a country that does not enjoy a
similar level of statutory data protection.
• Exception: the absolute right for any data subject to
request a copy of the data that any organization holds
about them (a ‘data subject access request’).
• Any system for holding personal data must be capable of
responding to a data subject access request.
116
3 Managing personal data
3.1 Primary responsibilities
• Primary responsibilities for processing personal data:
5. Ethical/legal use/collection/retention
• Ensuring that:
• the personal data is used only for the declared purposes, or
variants thereof,
• that those uses are both legal and ethical, and
• that only data necessary for those purposes is collected and
retained, and
• is collected in an ethical manner.
117
The ethical requirements
• are the most challenging.
• In its simplest form, ‘ethical use’ requires that no harm is
occasioned to the data subject as a result of processing
their personal data.
• We look at BCS and ACM
118
The ethical requirements: BCS
• See, for example, the BCS Code of Conduct (BCS, 2011),
which requires, among other things, that BCS members
shall:
1. Public Interest
a. have due regard for public health, privacy, security and wellbeing
of others …
b. have due regard for the legitimate rights of Third Parties
2. Professional Competence and Integrity
f. avoid injuring others, their property, reputation, or employment by
false or malicious or negligent action or inaction.
BCS, 2011
119
The ethical requirements: ACM
• The ACM Code of Ethics and Professional Conduct stipulates similar
requirements (ACM, 1992).
• If a decision is not to grant the loan, on the basis of some affordability criteria
designed for an ‘average’ family, would that refusal be ethical?
• What if the affordability criteria were not correct for the applicant?
• Or the consequence of the refusal were that the applicant were to be driven to
taking out a more expensive – and less affordable – loan?
• Or perhaps even if the decision were influenced by information obtained from,
say, social media postings from several years previously, would that still be
ethical?
• Indeed, is trawling social media to find possibly incriminating lifestyle evidence
(which amounts to obtaining personal data) ethical?
• And one final question: just because you can find data online, does that mean
you can use it? Might it be copyright?
• Or might the data subjects have had a reasonable expectation of privacy?
• Or might the means you have adopted to obtain the data constitute computer
misuse?
120
The ethical requirements: ACM
• The Computer Misuse Act 1990 (CMA90) defines several offences, the
first of which is unauthorised access to computer material. CMA90
states that:
• A person is guilty of an offence if—
• he causes a computer to perform any function with intent to secure access to any
program or data held in any computer, or to enable any such access to be
secured;
• the access he intends to secure, or to enable to be secured, is unauthorised; and
• he knows at the time when he causes the computer to perform the function that
that is the case.
• Great Britain (1990), Section 1
• So, if you knowingly do something in order to get access to information
which you have no right to see, even if it is apparently readily
accessible, it is likely that you are committing an offence under
CMA90. If the information you access is also personal data, you may
also be breaching DPA98.
121
legal and ethical questions
• The boundaries of accepted norms for legal and ethical
questions such as these can become somewhat murky.
The data protection principles might appear to be simple
and comprehensive; but simplicity and ethical standards
may not always make comfortable bedfellows.
• It follows that professionals responsible for managing – or
obtaining – personal data need to have a clear
understanding of their responsibilities, and also of the
limitations of their own (legal) expertise that frames their
discharge of those responsibilities.
122
3.2 Preventing disclosure
• The seventh data protection principle requires that:
• Appropriate technical and organisational measures shall
be taken against unauthorised or unlawful processing of
personal data and against accidental loss or destruction
of, or damage to, personal data.
• GTDP (ICO, 2015a, p. 15)
123
3.3 When personal data must be published
• not considering responses to data subject access
requests here
• when data for two or more individuals are held together; it
may be necessary to redact sufficient data so that the
identity of the third party is not disclosed.
• there can be a fundamental tension between FOIA2000
and DPA98.
124
4 Identity and (re-)identification
• a primary duty of a data controller – for personal data – is
not to permit disclosure of any such data, in an identifiable
form, except in very specific circumstances, to anyone
other than the data subject.
• In this section, we shall explore,
• what constitutes 'disclosure' and 'identification'
• the approaches available to conceal the identity of data subjects.
125
4 Identity and (re-)identification
• In Part 23, we shall explore some of the technical
measures that may be adopted in data management
systems to protect data from unauthorised access and
modification.
126
4.1 The notion of identity
• two aspects of identity: identification and verification (or
authentication).
• name on its own may be insufficient to identify a person
• A combination of values ( e.g. name & address, or name
& date of birth) are needed; but there is still a risk of
duplication.
• Even driver number and National Insurance number do
not solve the problem completely:
• if somebody were to address you by your National Insurance
number, would you know to whom they were talking?
127
4.2 Identification
• If the subject can be identified, then the personal data has been disclosed.
• Whether or not a data subject can be identified depends both on the data attributes that are published and
on other data that is already in the public domain
• Given that unambiguous identification is likely to depend on several attributes, publishing a subset of the
attributes required should not allow a data subject to be identified. For example, publishing data about
individuals including only their given names is unlikely to lead to their identification; but publication of, say,
their given names and partial addresses (say, road name and house number) may well be sufficient.
• A crucial consideration here is the scope of the published data. If it relates only to a restricted area, such
as a single village or town, then ‘Jim’ who lives at ‘43, High Street’ may constitute sufficient information to
identify a particular individual. However, it there is no such geographical restriction, there may be several
– or even many – individuals called Jim with that address, but who live in different towns.
• Even though this may be the case for an address like ‘xx High Street’, where the road name is common,
this would not apply for less common street names. For example, according to Google Maps, there is only
one Montpelier Square in the UK, which happens to be in Knightsbridge, London. So ‘Jim’, living at ‘67,
Montpelier Square’ (if such an address were to exist), could well be identifiable.
• It follows that care is needed to ensure that, if data is to be published for a set of individuals, then the
attributes released should not permit the identification of any data subjects. It is not sufficient to check that
the attributes released do not allow some particular individuals to be identified: that criterion must be
satisfied for all the data subjects involved.
• Furthermore, as noted in Activity 6.10, the requirement is that the data subject should not be identifiable
from data released in combination with other information already available to the recipient of the data.
This may include data such as telephone directories, electoral registers or even membership data for local
residents’ associations.
128
4.3 Anonymisation
• 3 approaches to hiding the identity of a data subject:
• Anonymisation
• Pseudonymisation
• Aggregation
• These techniques can also be applied where it is desired
that data, in a non-identifiable form, should be made
available for statistical research and data analysis.
• Check the ICO’s Anonymisation: managing data
protection risk code of practice (2012).
• describes when to anonymise and publish personal data
• The Appendices and Annexes provide overview key
anonymisation techniques
129
4.3 Anonymisation
• Requires suppression of all data attributes that could be
used, either alone or in combination, to identify the data
subject.
• Challenge is ensuring that there are no other attributes
already in the public domain that could be combined,
somehow, with the disclosed data to re-identify the data
subjects.
130
4.3 Anonymisation
• Pseudonymisation, replaces all identifying attributes with
a pseudo-identifier,
• Pseudo-identifiers are either allocated to a particular
individual or calculated from one or more identifying
attributes.
• Allows data records referring to particular individuals to be
linked, and patterns extracted, without identifying
subjects.
• Risk: linking records for an (unknown) individual provides
more details of an individual than a single record.
131
4.3 Anonymisation
• Aggregation of data – to sum, average, or perform some
other aggregate function on the data for several
individuals.
• Depending on the number of individuals, it can then be
very difficult to re-identify the data for a particular
individual.
• A small value can still lead to an increased risk of re-
identification.
• Techniques to reduce this risk:
• suppression, where aggregate values (typically counts) below a
certain threshold are not publishe
• blurring such as barnardisation, where small aggregate values
have noise added to them.
132
4.4 The fallibility of anonymisation: mosaics,
rainbows and triangulation
• The ICO Anonymisation Code of Practice asserts,
explicitly, in Section 3 that:
• It can be impossible to assess re-identification risk with
absolute certainty. ICO, 2012, p. 18
• Possible to re-identifying data – that is, identifying the
corresponding data subject – by:
• piecing together data in different, but related, datasets (mosaics)
• combining independent sources (triangulation)
• or by attacking directly a particular pseudonymisation process
(rainbows).
133
4.4 The fallibility of anonymisation: mosaics,
rainbows and triangulation
• The problem with pseudo-anonymisation:
• It is, in principle, possible to build a lookup (‘rainbow’)
table to re-identify the data.
• Key issue: are there clues that can help an attacker guess
the format of the real identifier & hence, reduce the
computational complexity of building a rainbow table?
• it should not be ‘reasonably likely’ that the intruder would
be able to re-identify the data subjects.
• it is necessary to consider what other data may already be
in the public domain.
• This will include official data already in the public domain,
such as telephone directories and the electoral register.
134
4.5 Responsibilities following re-identification
• should an individual successfully re-identify any data, then
they themselves become the data controller for that re-
identified data, and are subject to all of the obligations
specified in DPA98 – including:
• not disclosing that re-identified data, and
• not processing it in an unfair or illegal manner.
• Hence, they could also be liable to prosecution should
they fail in these obligations.
135
6 Summary
• In this part, we have explored the legal framework for managing data and, in particular, personal
data.
• Building on the introduction to the UK Data Protection Act (1998), and the Freedom of Information
Act (2000), we have explored the obligations placed on data controllers to both protect and, in
certain circumstances, publish data. We have related the legal obligations to the ethical
requirements of the codes of conduct for IT professionals produced by ACM and BCS. The
broader context, represented by European Union data directives, and the differences between
national approaches to protection for personally identifiable data, have also been explored.
• It has been emphasised that the legal context for data management is changing, and that changes
can be fast, far-reaching and unexpected. Hence, the principles explored in this part are precisely
that – principles – and reflect the legal position in 2015; current advice should be sought from
organisations such as the Information Commissioner’s Office before making any major decisions
affecting data protection or freedom of information.
• Accepting that it will sometimes be necessary – perhaps in response to a Freedom of Information
request – to publish personal data, we have exposed some of the issues and challenges
associated with anonymising personal data.
• The principal references for this part have been the many guides produced by the UK’s
Information Commissioner’s Office; you should now be familiar with the general content and scope
of these guides, and their relevance to any data analysis work you might undertake in the future.
• Finally, by means of a small case study based on synthetic data, we have demonstrated the
vulnerability to re-identification of even properly anonymised data.
136
ACTIVITIES
For part 2
137
Activity 6.1 Exploratory
• Up to 30 minutes
• Locate the relevant sections of the Guide to Data
Protection and the Guide to Freedom of Information to
expand the preceding bullet points.
• Note that, although you should make yourself familiar with
these concepts (not just where to look them up), you
should not, for this activity, need to read the whole of
either of the ICO Guides.
• Contents are accessible directly from the front page of the
web Guides above. Page numbers in the discussion
below refer to the PDF downloads of the complete guides.
(Refer to the TM351 learning materials for a complete discussion)
138
Activity 6.2 Exploratory
• 5 minutes
• Read the summary of this set of changes at:
• Protection of Freedoms Bill: Fact Sheet – Part 6 (
Home Office, 2011).
139
Activity 6.3 Exploratory
• 5 minutes
• Read the overview of the general approach to data
protection, agreed in June 2015:
• Data Protection: Council agrees on a general approach (
European Council, 2015).
140
Activity 6.4 Exploratory
• 10 minutes
• Read the summary of the ‘safe harbor scheme’ at:
• The US Safe Harbor Scheme (Pinsent Masons, 2015a).
• It is particularly telling that the EU considers that only a handful
of countries, listed in GTDP (ICO, 2015a, pp. 87–8), have an
adequate level of protection for personal data; and, even with
the ‘safe harbor’ agreement, the US is not included in this list.
Nor, for that matter, are countries to which operational data
processing is routinely outsourced, such as India.
• Furthermore, a recent (October 2015) ruling of the Court of
Justice of the European Union has ruled that data transfers
between EU countries and the US, under the safe harbor
scheme, are ‘invalid’.
141
Activity 6.5 Exploratory
• 5 minutes
• Read the summary of the Court of Justice of the EU’s
ruling:
• EU–US data transfers ‘invalid’ if made under ‘safe harbor’
regime, rules EU court
(Pinsent Masons 2015b).
• In effect, the Court ruled that the original recognition of
safe harbor by the European Commission was illegal,
and, therefore, that safe harbor is now ‘suspended’.
Although discussions on a new version of safe harbor are
apparently progressing, the advice from the ICO is
summarised in the ICO blog.
142
Activity 6.6 Exploratory
• 10 minutes
• Read David Smith’s article from the ICO blog, which gives
the ICO’s advice regarding ‘safe harbor’:
• The US Safe Harbor – breached but perhaps not destroyed!
(Smith, 2015).
• Note these key points in the ICO advice:
• ‘Don’t panic and don’t rush to other transfer mechanisms
that may turn out to be less than ideal.’
• ‘The first thing for businesses to do is take stock.’
• ‘The next few months will be critical.’
• One might be tempted to summarise by noting that IT law
can change even more rapidly – and unpredictably – than
computing technology!
143
Activity 6.7 Self-assessment
• Up to 40 minutes
• Read the following account of how personal data about children is
collected, legally according to the ‘patchwork quilt’ of US
legislation, and repurposed in various ways:
• Data mining your children (Simon, 2014).
• Is what is happening, according to this article, consistent with the
data protection principles enshrined in UK data protection law?
What discrepancies are there? Would those discrepancies have
been covered by exemptions stipulated in the Data Protection Act
1998?
• Putting the question more directly, could this have happened
under UK law? Should it have happened, had UK law applied?
• And, the final question is, would it have happened?
(Refer to the TM351 learning materials for a complete discussion)
144
Activity 6.8 Self-assessment
• Up to 15 minutes
• Read the following account of a security breach on a well-used website at:
• Booking site [Link] in ‘appalling’ data leak (Lee, 2014b).
• Would anyone accessing and exploiting data in the manner described in the report be likely
to be contravening either CMA90, DPA98 or both?
• Apart from questions of legality, would such an action – perhaps by a bank considering
whether or not to offer such a person a loan – be ethical?
• Discussion
• The only safe answer is – consult a legal adviser!
• However, it would seem that the action of deliberately altering a (personal) identifying code
in a web page address to find personal data about somebody else must constitute an
offence under CMA90.
• It would seem also to violate the first data protection principle – because data that is
obtained illegally cannot be processed legally. (See GTDP ‘Principle 1 – fair and lawful:
What is meant by “lawful”?’ (ICO, 2015a, p. 22).)
• As for the question of ethics, seeking information which an individual has a reasonable
expectation should remain private seems to fall far short of any reasonable standards for
acceptable professional conduct, and should that information then be used to harm or
disadvantage the data subject, it would surely also be unethical.
145
Activity 6.9 Exploratory
• 5 minutes
• To understand the scope of the tensions involved, read
the ‘boxed’ overview on pages 2–4 of S40PI (ICO, 2014).
146
Activity 6.10 Self-assessment
• 10 minutes
• How does DPA98 actually define personal data?
• Therefore, is simple anonymisation of personal data prior to publication always sufficient to
render it non-personal?
• Discussion
• The strict definition of personal data, in DPA98, is:
• ‘personal data’ means data which relate to a living individual who can be identified –
• from those data, or
• from those data and other information which is in the possession of, or is likely to come into
the possession of, the data controller,
• and includes any expression of opinion about the individual and any indication of the
intentions of the data controller or any other person in respect of the individual.
• GTDP (ICO, 2015b, p. 6)
• The key phrase in this definition is, ‘and other information which is in the possession of, or is
likely to come into the possession of’.
• It follows that it may not be sufficient simply to anonymise data to prevent identification, as
there may be other information, already in the public domain or readily obtainable by the
recipient of the data, that would enable re-identification of the ‘anonymised’ data.
147
Activity 6.11 Self-assessment
• 10 minutes
• How many code-like identifiers are you aware that you have been given? Are they unique globally or only within
the scope of the corresponding system (such as UK driving licences, or the UK benefits system)? Might any of
them change over time? Could it make sense to use any of them, on their own, in real-world communications?
• Discussion
• Individuals may have several surrogate identifiers, including, for example:
• National Insurance (NI) number
• National Health Service (NHS) Number
• payroll number
• OU personal identifier (PI)
• driver number (from DVLA)
• passport number
• mobile phone number.
• While most of these are identifiers, in the sense that each is unique for a particular individual, only NI number,
NHS Number and driver number are guaranteed not to change over time. In fact, even an NHS Number can
change – the UK replaced everyone’s NHS Number in the late 1970s, when the capacity of the previous coding
system was exhausted. There is no real guarantee that the same fate might not befall other identifiers.
• As to whether the identifiers are actually unique globally, rather than within the scope of the UK, that seems rather
unlikely. For example, a UK passport number is only nine digits – but there are more than 10 9 people in the world.
Furthermore, the passport number changes whenever a passport is renewed, at intervals of no more than
10 years.
• And as for using these numbers in communications, apart from confirming identity on formal documents, it is
unlikely that anyone would be happy to be addressed by a nine-digit number rather than by their name.
148
Activity 6.12 Self-assessment
• 10 minutes
• Imagine that you want to label your car key, so that it can be
given to some authorised person, such as your garage, but
identified as that for your car. Then, imagine that you lose that
key, complete with its label. Depending on how you have
labelled it, how easy might it be for the finder (a) to identify you
to return the key or (b) to identify the car … to steal it?
• You might have used:
• name
• address
• car registration
• telephone number
• car make, model and colour.
149
Activity 6.13 Exploratory
• Up to 20 minutes
• Download the ICO’s
Anonymisation: managing data protection risk code of practice (
ICO, 2012).
• Familiarise yourself with the content and scope of the code of practice
by reading the ‘Key points’ for each of Sections 1 to 8. You should not
need to read the whole of the guide in order to understand what it
covers and where to find particular guidance.
• Note, in particular, the ‘Key points’ for Sections 2, 3 and 7 of the code
of practice. You may find it useful to make notes from Section 5, which
offers useful background for the case study later in this part. Note also
the description of the ‘motivated intruder’ test in Section 3 of the code
(pp. 22–4): this is the key benchmark against which it is (currently)
necessary to judge the anonymisation of personal data.
• Read Appendix 2 (pp. 51–3) which summarises the key anonymisation
techniques.
150
EXERCISES
For part 2
151
• There are no exercises for this part