93% found this document useful (27 votes)
18K views359 pages

Statistical Data Analysis Explained

This document provides an overview of statistical data analysis techniques and their application to environmental data. It discusses preparing data, exploring data distributions through various graphics, calculating statistical measures, and mapping spatial data. Various statistical concepts and related R functions are explained.

Uploaded by

malikjunaid
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
93% found this document useful (27 votes)
18K views359 pages

Statistical Data Analysis Explained

This document provides an overview of statistical data analysis techniques and their application to environmental data. It discusses preparing data, exploring data distributions through various graphics, calculating statistical measures, and mapping spatial data. Various statistical concepts and related R functions are explained.

Uploaded by

malikjunaid
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 359

Statistical Data Analysis Explained

Statistical Data Analysis Explained: Applied Environmental Statistics with R. C. Reimann, P. Filzmoser, R. G. Garrett,
R. Dutter © 2008 John Wiley & Sons, Ltd. ISBN: 978-0-470-98581-6
Statistical Data Analysis Explained
Applied Environmental Statistics with R

Clemens Reimann
Geological Survey of Norway
Peter Filzmoser
Vienna University of Technology
Robert G. Garrett
Geological Survey of Canada
Rudolf Dutter
Vienna University of Technology
Copyright © 2008 John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester,
West Sussex PO19 8SQ, England
Telephone (+44) 1243 779777
Email (for orders and customer service enquiries): [email protected]
Visit our Home Page on www.wileyeurope.com or www.wiley.com
All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in
any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except under
the terms of the Copyright, Designs and Patents Act 1988 or under the terms of a licence issued by the Copyright
Licensing Agency Ltd, 90 Tottenham Court Road, London W1T 4LP, UK, without the permission in writing of the
Publisher. Requests to the Publisher should be addressed to the Permissions Department, John Wiley & Sons Ltd, The
Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, England, or emailed to [email protected], or faxed
to (+44) 1243 770620.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and
product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective
owners. The Publisher is not associated with any product or vendor mentioned in this book.
This publication is designed to provide accurate and authoritative information in regard to the subject matter covered.
It is sold on the understanding that the Publisher is not engaged in rendering professional services. If professional
advice or other expert assistance is required, the services of a competent professional should be sought.
Other Wiley Editorial Offices
John Wiley & Sons Inc., 111 River Street, Hoboken, NJ 07030, USA
Jossey-Bass, 989 Market Street, San Francisco, CA 94103-1741, USA
Wiley-VCH Verlag GmbH, Boschstr. 12, D-69469 Weinheim, Germany
John Wiley & Sons Australia Ltd, 33 Park Road, Milton, Queensland 4064, Australia
John Wiley & Sons (Asia) Pte Ltd, 2 Clementi Loop #02-01, Jin Xing Distripark, Singapore 129809
John Wily & Sons Canada Ltd, 6045 Freemont Blvd, Mississauga, Ontario, L5R 4J3
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be
available in electronic books.
British Library Cataloguing in Publication Data
A catalogue record for this book is available from the British Library

ISBN 978-0-470-98581-6

Typeset in 10/12 pt Times by Thomson Digital, Noida, India


Printed and bound in Great Britain by Antony Rowe Ltd., Chippenham, Wilts
This books is printed on acid-free paper
Contents

Preface xiii

Acknowledgements xv

About the authors xvii

1 Introduction 1
1.1 The Kola Ecogeochemistry Project 5
1.1.1 Short description of the Kola Project survey area 6
1.1.2 Sampling and characteristics of the different sample materials 9
1.1.3 Sample preparation and chemical analysis 11

2 Preparing the Data for Use in R and DAS+R 13


2.1 Required data format for import into R and DAS+R 14
2.2 The detection limit problem 17
2.3 Missing values 20
2.4 Some "typical" problems encountered when editing a laboratory data report
file to a DAS+R file 21
2.4.1 Sample identification 22
2.4.2 Reporting units 22
2.4.3 Variable names 23
2.4.4 Results below the detection limit 23
2.4.5 Handling of missing values 24
2.4.6 File structure 24
2.4.7 Quality control samples 25
2.4.8 Geographical coordinates, further editing and some unpleasant
limitations of spreadsheet programs 25
2.5 Appending and linking data files 25
2.6 Requirements for a geochemical database 27
2.7 Summary 28
vi CONTENTS

3 Graphics to Display the Data Distribution 29


3.1 The one-dimensional scatterplot 29
3.2 The histogram 31
3.3 The density trace 34
3.4 Plots of the distribution function 35
3.4.1 Plot of the cumulative distribution function (CDF-plot) 35
3.4.2 Plot of the empirical cumulative distribution function
(ECDF-plot) 36
3.4.3 The quantile-quantile plot (QQ-plot) 36
3.4.4 The cumulative probability plot (CP-plot) 39
3.4.5 The probability-probability plot (PP-plot) 40
3.4.6 Discussion of the distribution function plots 41
3.5 Boxplots 41
3.5.1 The Tukey boxplot 42
3.5.2 The log-boxplot 44
3.5.3 The percentile-based boxplot and the box-and-whisker plot 46
3.5.4 The notched boxplot 47
3.6 Combination of histogram, density trace, one-dimensional scatterplot,
boxplot, and ECDF-plot 48
3.7 Combination of histogram, boxplot or box-and-whisker plot, ECDF-plot,
and CP-plot 49
3.8 Summary 50

4 Statistical Distribution Measures 51

4.1 Central value 51


4.1.1 The arithmetic mean 51
4.1.2 The geometric mean 52
4.1.3 The mode 52
4.1.4 The median 52
4.1.5 Trimmed mean and other robust measures of the central
value 53
4.1.6 Influence of the shape of the data distribution 53
4.2 Measures of spread 56
4.2.1 The range 56
4.2.2 The interquartile range (IQR) 56
4.2.3 The standard deviation 57
4.2.4 The median absolute deviation (MAD) 57
4.2.5 Variance 58
4.2.6 The coefficient of variation (CV) 58
4.2.7 The robust coefficient of variation (CVR) 59
4.3 Quartiles, quantiles and percentiles 59
4.4 Skewness 59
CONTENTS vii

4.5 Kurtosis 59
4.6 Summary table of statistical distribution measures 60
4.7 Summary 60

5 Mapping Spatial Data 63


5.1 Map coordinate systems (map projection) 64
5.2 Map scale 65
5.3 Choice of the base map for geochemical mapping 66
5.4 Mapping geochemical data with proportional dots 68
5.5 Mapping geochemical data using classes 69
5.5.1 Choice of symbols for geochemical mapping 70
5.5.2 Percentile classes 71
5.5.3 Boxplot classes 71
5.5.4 Use of ECDF- and CP-plot to select classes for mapping 74
5.6 Surface maps constructed with smoothing techniques 74
5.7 Surface maps constructed with kriging 76
5.7.1 Construction of the (semi)variogram 76
5.7.2 Quality criteria for semivariograms 79
5.7.3 Mapping based on the semivariogram (kriging) 79
5.7.4 Possible problems with semivariogram estimation and kriging 80
5.8 Colour maps 82
5.9 Some common mistakes in geochemical mapping 84
5.9.1 Map scale 84
5.9.2 Base map 84
5.9.3 Symbol set 84
5.9.4 Scaling of symbol size 84
5.9.5 Class selection 86
5.10 Summary 88

6 Further Graphics for Exploratory Data Analysis 91


6.1 Scatterplots (xy-plots) 91
6.1.1 Scatterplots with user-defined lines or fields 92
6.2 Linear regression lines 93
6.3 Time trends 95
6.4 Spatial trends 97
6.5 Spatial distance plot 99
6.6 Spiderplots (normalised multi-element diagrams) 101
6.7 Scatterplot matrix 102
6.8 Ternary plots 103
6.9 Summary 106

7 Defining Background and Threshold, Identification of Data Outliers and


Element Sources 107
7.1 Statistical methods to identify extreme values and data outliers 108
viii CONTENTS

7.1.1 Classical statistics 108


7.1.2 The boxplot 109
7.1.3 Robust statistics 110
7.1.4 Percentiles 111
7.1.5 Can the range of background be calculated? 112
7.2 Detecting outliers and extreme values in the ECDF- or CP-plot 112
7.3 Including the spatial distribution in the definition of background 114
7.3.1 Using geochemical maps to identify a reasonable threshold 114
7.3.2 The concentration-area plot 115
7.3.3 Spatial trend analysis 118
7.3.4 Multiple background populations in one data set 119
7.4 Methods to distinguish geogenic from anthropogenic element sources 120
7.4.1 The TOP/BOT-ratio 120
7.4.2 Enrichment factors (EFs) 121
7.4.3 Mineralogical versus chemical methods 128
7.5 Summary 128

8 Comparing Data in Tables and Graphics 129


8.1 Comparing data in tables 129
8.2 Graphical comparison of the data distributions of several data sets 133
8.3 Comparing the spatial data structure 136
8.4 Subset creation – a mighty tool in graphical data analysis 138
8.5 Data subsets in scatterplots 141
8.6 Data subsets in time and spatial trend diagrams 142
8.7 Data subsets in ternary plots 144
8.8 Data subsets in the scatterplot matrix 146
8.9 Data subsets in maps 147
8.10 Summary 148

9 Comparing Data Using Statistical Tests 149


9.1 Tests for distribution (Kolmogorov–Smirnov and Shapiro–Wilk tests) 150
9.1.1 The Kola data set and the normal or lognormal distribution 151
9.2 The one-sample t-test (test for the central value) 154
9.3 Wilcoxon signed-rank test 156
9.4 Comparing two central values of the distributions of independent data groups 157
9.4.1 The two-sample t-test 157
9.4.2 The Wilcoxon rank sum test 158
9.5 Comparing two central values of matched pairs of data 158
9.5.1 The paired t-test 158
9.5.2 The Wilcoxon test 160
9.6 Comparing the variance of two data sets 160
9.6.1 The F-test 160
9.6.2 The Ansari–Bradley test 160
CONTENTS ix

9.7 Comparing several central values 161


9.7.1 One-way analysis of variance (ANOVA) 161
9.7.2 Kruskal-Wallis test 161
9.8 Comparing the variance of several data groups 161
9.8.1 Bartlett test 161
9.8.2 Levene test 162
9.8.3 Fligner test 162
9.9 Comparing several central values of dependent groups 163
9.9.1 ANOVA with blocking (two-way) 163
9.9.2 Friedman test 163
9.10 Summary 164

10 Improving Data Behaviour for Statistical Analysis: Ranking


and Transformations 167
10.1 Ranking/sorting 168
10.2 Non-linear transformations 169
10.2.1 Square root transformation 169
10.2.2 Power transformation 169
10.2.3 Log(arithmic)-transformation 169
10.2.4 Box–Cox transformation 171
10.2.5 Logit transformation 171
10.3 Linear transformations 172
10.3.1 Addition/subtraction 172
10.3.2 Multiplication/division 173
10.3.3 Range transformation 174
10.4 Preparing a data set for multivariate data analysis 174
10.4.1 Centring 174
10.4.2 Scaling 174
10.5 Transformations for closed number systems 176
10.5.1 Additive logratio transformation 177
10.5.2 Centred logratio transformation 178
10.5.3 Isometric logratio transformation 178
10.6 Summary 179

11 Correlation 181
11.1 Pearson correlation 182
11.2 Spearman rank correlation 183
11.3 Kendall-tau correlation 184
11.4 Robust correlation coefficients 184
11.5 When is a correlation coefficient significant? 185
11.6 Working with many variables 185
x CONTENTS

11.7 Correlation analysis and inhomogeneous data 187


11.8 Correlation results following additive logratio or centred logratio
transformations 189
11.9 Summary 191

12 Multivariate Graphics 193


12.1 Profiles 193
12.2 Stars 194
12.3 Segments 196
12.4 Boxes 197
12.5 Castles and trees 198
12.6 Parallel coordinates plot 198
12.7 Summary 200

13 Multivariate Outlier Detection 201


13.1 Univariate versus multivariate outlier detection 201
13.2 Robust versus non-robust outlier detection 204
13.3 The chi-square plot 205
13.4 Automated multivariate outlier detection and visualisation 205
13.5 Other graphical approaches for identifying outliers and groups 208
13.6 Summary 210

14 Principal Component Analysis (PCA) and Factor Analysis (FA) 211


14.1 Conditioning the data for PCA and FA 212
14.1.1 Different data ranges and variability, skewness 212
14.1.2 Normal distribution 213
14.1.3 Data outliers 213
14.1.4 Closed data 214
14.1.5 Censored data 215
14.1.6 Inhomogeneous data sets 215
14.1.7 Spatial dependence 215
14.1.8 Dimensionality 216
14.2 Principal component analysis (PCA) 216
14.2.1 The scree plot 217
14.2.2 The biplot 219
14.2.3 Mapping the principal components 220
14.2.4 Robust versus classical PCA 221
14.3 Factor analysis 222
14.3.1 Choice of factor analysis method 224
14.3.2 Choice of rotation method 224
14.3.3 Number of factors extracted 224
14.3.4 Selection of elements for factor analysis 225
14.3.5 Graphical representation of the results of factor analysis 225
CONTENTS xi

14.3.6 Robust versus classical factor analysis 229


14.4 Summary 231

15 Cluster Analysis 233


15.1 Possible data problems in the context of cluster analysis 234
15.1.1 Mixing major, minor and trace elements 234
15.1.2 Data outliers 234
15.1.3 Censored data 235
15.1.4 Data transformation and standardisation 235
15.1.5 Closed data 235
15.2 Distance measures 236
15.3 Clustering samples 236
15.3.1 Hierarchical methods 236
15.3.2 Partitioning methods 239
15.3.3 Model-based methods 240
15.3.4 Fuzzy methods 242
15.4 Clustering variables 242
15.5 Evaluation of cluster validity 244
15.6 Selection of variables for cluster analysis 246
15.7 Summary 247

16 Regression Analysis (RA) 249


16.1 Data requirements for regression analysis 251
16.1.1 Homogeneity of variance and normality 251
16.1.2 Data outliers, extreme values 253
16.1.3 Other considerations 253
16.2 Multiple regression 254
16.3 Classical least squares (LS) regression 255
16.3.1 Fitting a regression model 255
16.3.2 Inferences from the regression model 256
16.3.3 Regression diagnostics 259
16.3.4 Regression with opened data 259
16.4 Robust regression 260
16.4.1 Fitting a robust regression model 261
16.4.2 Robust regression diagnostics 262
16.5 Model selection in regression analysis 264
16.6 Other regression methods 266
16.7 Summary 268

17 Discriminant Analysis (DA) and Other Knowledge-Based Classification


Methods 269
17.1 Methods for discriminant analysis 269
17.2 Data requirements for discriminant analysis 270
xii CONTENTS

17.3 Visualisation of the discriminant function 271


17.4 Prediction with discriminant analysis 272
17.5 Exploring for similar data structures 275
17.6 Other knowledge-based classification methods 276
17.6.1 Allocation 276
17.6.2 Weighted sums 278
17.7 Summary 280

18 Quality Control (QC) 281


18.1 Randomised samples 282
18.2 Trueness 282
18.3 Accuracy 284
18.4 Precision 286
18.4.1 Analytical duplicates 287
18.4.2 Field duplicates 289
18.5 Analysis of variance (ANOVA) 290
18.6 Using maps to assess data quality 293
18.7 Variables analysed by two different analytical techniques 294
18.8 Working with censored data – a practical example 296
18.9 Summary 299

19 Introduction to R and Structure of the DAS+R Graphical User Interface 301


19.1 R 301
19.1.1 Installing R 301
19.1.2 Getting started 302
19.1.3 Loading data 302
19.1.4 Generating and saving plots in R 303
19.1.5 Scatterplots 305
19.2 R-scripts 307
19.3 A brief overview of relevant R commands 311
19.4 DAS+R 315
19.4.1 Loading data into DAS+R 316
19.4.2 Plotting diagrams 316
19.4.3 Tables 317
19.4.4 Working with “worksheets” 317
19.4.5 Groups and subsets 317
19.4.6 Mapping 318
19.5 Summary 318

References 321

Index 337
Preface

Although several books already exist on statistical data analysis in the natural sciences, there
are few books written at a level that a non-statistician will easily understand. In our experience
many colleagues in earth and environmental sciences are not sufficiently trained in mathematics
or statistics to easily comprehend the necessary formalism. This is a book written in colloquial
language, avoiding mathematical formulae as much as possible (some may argue too much)
trying to explain the methods using examples and graphics instead. To use the book efficiently,
readers should have some computer experience and some basic understanding of statistical
methods. We start with the simplest of statistical concepts and carry readers forward to a deeper
and more extensive understanding of the use of statistics in the natural sciences. Importantly,
users of the book, rather than readers, will require a sound knowledge of their own branch of
natural science.
In the book we try to demonstrate, based on practical examples, how data analysis in envi-
ronmental sciences should be approached, outline advantages and disadvantages of methods
and show and discuss the do’s and don’ts. We do not use "simple toy examples" to demonstrate
how well certain statistical techniques function. The book rather uses a single, large, real world
example data set, which is investigated in more and more depth throughout the book. We feel
that this makes it an interesting read from beginning to end, without preventing the use of single
chapters as a reference for certain statistical techniques. This approach also clearly demon-
strates the limits of classical statistical data analysis with environmental (geochemical) data.
The special properties of environmental data (e.g., spatial dependencies, outliers, skewed dis-
tributions, closure) do not agree well with the assumptions of "classical" (Gaussian) statistics.
These are, however, the statistical methods taught in all basic statistics courses at universities
because they are the most fundamental statistical methods. As a consequence, up to this day,
techniques that are far from ideal for the data at hand are widely applied by earth and envi-
ronmental scientists in data analysis. Applied earth science data call for the use of robust and
non-parametric statistical methods. These techniques are extensively used and demonstrated
in the book. The focus of the book is on the exploratory use of statistical methods extensively
applying graphical data analysis techniques.
The book concerns the application of statistical and other computer methods to the manage-
ment, analysis and display of spatial data. These data are characterised by including locations
(geographic coordinates), which leads to the necessity of using maps to display the data and the
results of the statistical methods. Although the book uses examples from applied geochemistry,
the principles and ideas equally well apply to other natural sciences, e.g., environmental sci-
ences, pedology, hydrology, geography, forestry, ecology, and health sciences/epidemiology.
That is, to anybody using spatially dependent data. The book will be useful to postgraduate
xiv PREFACE

students, possibly final year students with dissertation projects, students and others interested
in the application of modern statistical methods (and not so much in theory), and natural sci-
entists and other applied statistical professionals. The book can be used as a textbook, full of
practical examples or in a basic university course on exploratory data analysis for spatial data.
The book can also serve as a manual to many statistical methods and will help the reader to
better understand how different methods can be applied to their data – and what should not be
done with the data.
The book is unique because it supplies direct access to software solutions (based on R, the
Open Source version of the S-language for statistics) for applied environmental statistics.
For all graphics and tables presented in the book, the R-codes are provided in the form
of executable R-scripts. In addition, a graphical user interface for R, called DAS+R, was
developed by the last author for convenient, fast and interactive data analysis. Providing
powerful software for the combination of statistical data analysis and mapping is one of
the highlights of the software tools. This software may be used with the example data as a
teaching/learning tool, or with the reader’s own data for research.

Clemens Reimann Peter Filzmoser Robert G. Garrett Rudolf Dutter


Geochemist Statistician Geochemist Statistician

Trondheim, Vienna, Ottawa


September 1, 2007.
Acknowledgements

This book is the result of a fruitful cooperation between statisticians and geochemists that has
spanned many years. We thank our institutions (the Geological Surveys of Norway (NGU) and
Canada (GSC) and Vienna University of Technology (VUT)) for providing us with the time
and opportunity to write the book. The Department for International Relations of VUT and
NGU supported some meetings of the authors.
We thank the Wiley staff for their very professional support and discussions.
Toril Haugland and Herbert Weilguni were our test readers, they critically read the whole
manuscript, made many corrections and valuable comments.
Many external reviewers read single chapters of the book and suggested important changes.
The software accompanying the book was developed with the help of many VUT students,
including Andreas Alfons, Moritz Gschwandner, Alexander Juschitz, Alexander Kowarik,
Johannes Löffler, Martin Riedler, Michael Schauerhuber, Stefan Schnabl, Christian Schwind,
Barbara Steiger, Stefan Wohlmuth and Andreas Zainzinger, together with the authors.
Friedrich Leisch of the R core team and John Fox and Matthias Templ were always available
for help with R and good advice concerning R-commander.
Friedrich Koller supplied lodging, many meals and stimulating discussions for Clemens
Reimann when working in Vienna. Similarly, the Filzmoser family generously hosted Robert
G. Garrett during a working visit to Austria.
NGU allowed us to use the Kola Project data; the whole Kola Project team is thanked for
many important discussions about the interpretation of the results through many years.
Arne Bjørlykke, Morten Smelror, and Rolf Tore Ottesen wholeheartedly backed the project
over several years.
Heidrun Filzmoser is thanked for translating the manuscript from Word into Latex. The fam-
ilies of the authors are thanked for their continued support, patience with us and understanding.
Many others that are not named above contributed to the outcome, we wish to express our
gratitude to all of them.
About the authors

Clemens REIMANN
Clemens Reimann (born 1952) holds an M.Sc. in Mineralogy and Petrology from the Uni-
versity of Hamburg (Germany), a Ph.D. in Geosciences from Leoben Mining University,
Austria, and a D.Sc. in Applied Geochemistry from the same university. He has worked
as a lecturer in Mineralogy and Petrology and Environmental Sciences at Leoben Mining
University, as an exploration geochemist in eastern Canada, in contract research in envi-
ronmental sciences in Austria and managed the laboratory of an Austrian cement company
before joining the Geological Survey of Norway in 1991 as a senior geochemist. From
March to October 2004 he was director and professor at the German Federal Environment
Agency (Umweltbundesamt, UBA), responsible for the Division II, Environmental Health
and Protection of Ecosystems. At present he is chairman of the EuroGeoSurveys geochem-
istry expert group, acting vice president of the International Association of GeoChemistry
(IAGC), and associate editor of both Applied Geochemistry and Geochemistry: Exploration,
Environment, Analysis.

Peter FILZMOSER
Peter Filzmoser (born 1968) studied Applied Mathematics at the Vienna University of
Technology, Austria, where he also wrote his doctoral thesis and habilitation devoted to the
field of multivariate statistics. His research led him to the area of robust statistics, resulting
in many international collaborations and various scientific papers in this area. His interest
in applications of robust methods resulted in the development of R software packages.
He was and is involved in the organisation of several scientific events devoted to robust
statistics. Since 2001 he has been dozent at the Statistics Department at Vienna University
of Technology. He was visiting professor at the Universities of Vienna, Toulouse and Minsk.

Robert G. GARRETT
Bob Garrett studied Mining Geology and Applied Geochemistry at Imperial College, Lon-
don, and joined the Geological Survey of Canada (GSC) in 1967 following post-doctoral
studies at Northwestern University, Evanston. For the next 25 years his activities focussed
on regional geochemical mapping in Canada, and overseas for the Canadian International
Development Agency, to support mineral exploration and resource appraisal. Throughout
his work there has been a use of computers and statistics to manage data, assess their quality,
and maximise the knowledge extracted from them. In the 1990s he commenced collabora-
tions with soil and agricultural scientists in Canada and the US concerning trace elements
in crops. Since then he has been involved in various Canadian Federal and university-based
research initiatives aimed at providing sound science to support Canadian regulatory and
xviii ABOUT THE AUTHORS

international policy activities concerning risk assessments and risk management for metals.
He retired in March 2005 but remains active as an Emeritus Scientist.
Rudolf DUTTER
Rudolf Dutter is senior statistician and full professor at Vienna University of Technology,
Austria. He studied Applied Mathematics in Vienna (M.Sc.) and Statistics at Université de
Montréal, Canada (Ph.D.). He spent three years as a post-doctoral fellow at ETH, Zurich,
working on computational robust statistics. Research and teaching activities followed at the
Graz University of Technology, and as a full professor of statistics at Vienna University of
Technology, both in Austria. He also taught and consulted at Leoben Mining University,
Austria; currently he consults in many fields of applied statistics with main interests in
computational and robust statistics, development of statistical software, and geostatistics.
He is author and coauthor of many publications and several books, e.g., an early booklet in
German on geostatistics.
Figure 1.2 Geological map of the Kola Project survey area (modified from Reimann et al., 1998a)

Statistical Data Analysis Explained: Applied Environmental Statistics with R. C. Reimann, P. Filzmoser, R. G. Garrett,
R. Dutter © 2008 John Wiley & Sons, Ltd. ISBN: 978-0-470-98581-6
Figure 5.1 Geological map (upper left, see Figure 1.2 for a legend) and simple topographical map
(upper right) of the Kola Project area used as base maps for geochemical mapping (lower maps). Lower
maps: distribution of Mg in O-horizon soils added to base maps.
Figure 5.12 Colour smoothed surface maps for As in Kola C-horizon soils. Different colour scales are
used. Upper left, percentile scale using rainbow colours; upper right, inverse rainbow colours; middle
left, terrain colours; middle right, topographic colours; lower left, continuous percentile scale with
rainbow colours; and lower right, truncated percentile scale
99 99.9

99 99.9
● ●
Moss ●
Moss ●
O−horizon ●


O−horizon ●

● ●
● B−horizon ●●
● ● B−horizon ●●
●● ●●

●●
● ●



● ●●●
C−horizon ●
●●

● C−horizon ●


●●

95

95

● ●●


● ●
●●


● ●
●●
●● ●
●●

● ●
●●●

●●
● ●
●●


● ●

●●
●● ●●


Probability [%]

Probability [%]


● ●●

●●

●● ●
●●

● ●
80

80

● ●●



● ●●


● ●●


●● ●●



● ●
●●

●● ●


●●


● ●●



●● ●
●●



● ●

●●
●●
● ●●

●●

● ●
●●

● ●
50

50
●● ●
●●

● ●●

●●

●● ●●



● ●
●●


●● ●

●●

● ●
●●

●● ●




● ●●



● ●●




● ●
●●

● ●●

● ●
20

20

●● ●●


● ●

●●
●● ●●


● ●●



● ●●

● ●●
●●



● ●●●
●● ●



● ●
●●
●● ●


●● ●
●●●
5

5


● ●●

● ●
●● ●●
●● ● ●●

●● ●



● ●


● ●


● ●

1

1
● ●
● ●
● ●
● ●
● ●
0.1

0.1
● ●

50 200 1000 5000 20000 1e+05 100 200 500 1000 5000

Al [mg/kg] K [mg/kg]
99 99.9

Moss

99 99.9 Moss

● ●
O−horizon ●


O−horizon ●

● ● ●

● B−horizon ●


● B−horizon ●



● ●

●●
● ●●

● ●
C−horizon ●




C−horizon ●




95

95

●● ●

●●


● ●




● ●


●●
● ●●

●●
● ●


●●
●● ●●

● ●●

● ●
Probability [%]

Probability [%]


●● ●●


● ●



● ●
80

80

●● ●●


● ●



●● ●


●● ●




●● ●


●● ●



●● ●




● ●
●●

●● ●●


● ●


● ●
50

50

●● ●
●●


● ●
●●

●● ●●



● ●




● ●
●●
●●
● ●
●●


● ●


●●
● ●



● ●


●● ●

20

20



● ●



● ●



● ●
●●


● ●●


● ●




● ●


● ●


●● ●



● ●●


● ●
●●

● ●
●●
● ●
5



● ●




● ●



● ●


●● ●●


● ●


●● ●

● ●●

●● ●

1

● ●
● ●
● ●
● ●
● ●
0.1

0.1

● ●

0.5 2 5 10 50 200 1000 2 5 10 50 200 500 2000

Pb [mg/kg] S [mg/kg]

Figure 8.2 The same data as above (Figure 8.1) compared using CP-plots. These plots are more
impressive when different colours are added to the different symbols
400

● ●
2.5


● ●


●● ●
● ●



2.0

●●●
300


● ● ●
●●
Ni in Moss [mg/kg]

Ni in Moss [mg/kg]


●● ●●

● ●●

●●●●
● ●● ●
● ● ●
● ●
● ● ● ●● ●
● ● ●● ●
● ● ● ●
●● ● ● ●

1.5

●● ●
● ●
●● ● ● ●
● ●
●●●● ●

● ● ● ●
●● ●●● ●● ●●
200

●●● ● ●

● ●● ●●
● ● ●●
●● ● ●● ●
●● ●
● ●●
● ●●
● ● ●●
●● ●● ●

● ●
● ●● ●●●●● ●●
● ●

● ● ● ●
●●● ●●
●●●
●●●
1.0

● ●●●● ●
● ●●●●●●●
● ● ● ●●●
● ● ●●●
●● ●● ●● ●
● ● ● ●●●
●●
● ●● ●●●
●●
● ●●● ●

●● ● ●●
● ●●●
●● ●
● ●●
100

● ●●
● ● ● ●● ●●●
●●●
● ●● ● ●
● ● ● ●● ● ● ●
● ● ●

● ●
0.5

● ● ● ●
●● ●● ●
●● ●
●●

● ● ●
●● ● ●●●●
● ● Russia ●

●● ● Russia
● ● ●
●● ●●
● ●●●●● ● ●
● ● ●●●



●●

●●●




●●


● ●●
●●














● ●
●● ● ●

Norway Norway
●●
●●●●
●●●

●●●
●●
●●
●●
●●
●●●
Finland Finland
0.0


●●


●●



●●

●●●
●●


●●

●●


●●




●●




●●

●●

●●


0

0 50 100 150 200 5 10 20 50 100 200

Cu in Moss [mg/kg] Cu in Moss [mg/kg]

Figure 8.8 Scatterplot of Ni versus Cu in moss (compare Figure 6.1). The country of origin of the
samples is encoded into the symbols

2.5

● ●

● ● ●


● ● ●
● ●

7800000
● ●



● ● ●
● ●
● ●
● ●


● ●


● ● ●

● ● ● ●

● ● ●
● ● ●


● ●

● ● ●
● ●

2.0
● ●


● ●
● ● ●
● ●



● ● ●
● ● ●

Ni in Moss [mg/kg]

● ●
● ●
● ●

● ●

●●


● ●
● ●
● ●

● ●
● ● ●

●●●●

UTM north [m]

●●


● ● ●


● ●
● ● ●
● ●●

● ●

● ●

● ●
● ● ● ●


● ●
● ● ●
● ● ●

● ●
● ●● ●
● ●
● ● ●
● ● ●

1.5

● ● ●

●●
● ● ●



● ●


● ● ● ●

● ● ●

● ●●
● ● ● ● ●


● ● ●
● ● ●

● ● ●
●●
● ● ●
● ● ● ● ●
● ● ●

● ● ● ●
● ● ●


● ● ●
7600000

● ● ● ● ● ●
● ● ● ●
● ● ●
● ● ● ● ● ●●
● ● ● ●



● ●
● ● ●● ●
● ● ● ● ● ● ●
● ●


● ● ● ● ● ●
● ●


● ●
● ● ● ●
● ●
● ● ●● ● ●


● ●


● ● ● ●
● ● ● ● ●


● ● ● ● ●
● ● ● ●

● ● ● ● ●
● ●
● ● ● ●

● ● ● ●

● ●
● ● ● ● ●


● ● ● ●
● ● ● ● ● ●



● ● ● ● ●
●●

● ●
● ●

●●

● ● ● ●● ● ● ●


● ● ● ●
●●


● ●
● ● ●
● ● ●

● ●

● ● ● ●


1.0



● ● ●

● ● ●● ●


● ●● ●



● ● ● ● ●
● ● ●
● ● ●●

● ●
● ●●


● ●



● ● ●


● ● ●


● ● ● ● ●
● ● ●
● ● ● ● ● ●


● ● ● ● ●


● ● ● ●


● ● ●

● ● ● ● ●


● ●

● ●
● ● ● ●
● ●
● ●●


● ● ● ● ●


● ●

● ●
● ●● ● ●

● ●


● ● ● ● ●
● ● ● ●
● ● ●


● ● ● ●
● ●

● ●
● ● ●
● ● ● ● ● ●

● ● ● ● ●

● ●
● ● ● ●


● ● ●


● ● ● ●
● ●
● ● ● ● ●●

● ●
● ● ● ●

● ● ● ●● ●●
● ● ● ●●
● ● ●●

●● ●
● ● ● ● ● ● ● ● ●


● ● ●●


● ●

●●
● ● ● ●●
● ● ● ● ●
● ● ● ● ● ●● ●


● ● ●
● ● ● ●● ● ● ●
● ● ●
● ● ●
● ●



● ● ● ● ●●


● ● ● ● ● ● ●
● ● ●● ● ●
● ●
● ● ● ●
● ● ● ●

● ●

0.5
● ● ● ●●
● ● ● ●
● ●


● ● ●
● ● ●
● ● ●●
● ● ● ● ● ●● ●
● ● ●
● ● ● ●●
● ● ●
● ● ●

● ● ● ●
● ● ● ● ●●●● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ●
● ●
● ●● ●
● ●
● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ●

Other
● ●
● ● ● ●


7400000

● ● ●●
● ●● ●

● ● ● ● ●
● ● ● ● ●● ●
● ● ● ● ●●
● ● ● ● ● ●
● ● ● ● ● ●
● ●● ●
● ●

● ●● ● ● ●
● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●

● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ●● ●● ●●
●● ●● ●● ● ●


● ●●

Monchegorsk
● ● ● ●●● ● ● ●
● ● ●
● ●
● ● ● ● ● ●● ● ●●

● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●

● ● ● ● ● ●●● ●
● ● ●
● ● ● ●
● ● ●

● ● ● ● ● ● ●
● ● ●

● ●
● ●● ●
● ● ●
● ● ● ●
● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ●

● ● ●

Nikel−Zapoljarnij

0.0
● ● ●
● ● ●
● ● ●
● ● ● ● ●

4e+05 6e+05 8e+05 5 10 20 50 100 200

UTM east [m] Cu in Moss [mg/kg]

Figure 8.9 Map identifying the location of the subsets (left) and a scatterplot of Cu versus Ni (right),
using the same symbols for the subsets
5000

100

Moss Moss
O−horizon O−horizon ●

C−horizon C−horizon ●
1000




● ●
50

● ●




● ● ●

● ● ●
● ●

● ● ●
● ●
● ● ●
200
Cu [mg/kg]

Cu [mg/kg]

● ●
● ●



20

● ●

● ● ● ●
● ●
● ●
● ● ●
● ●

● ● ●

● ● ● ●
● ● ● ●
● ● ●
● ●
● ●

50

● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
10


● ● ● ● ● ● ●
● ● ●
● ● ●● ●
● ● ● ● ●
● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ●
● ● ● ●
●● ● ●
● ● ● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ●● ● ●
● ● ● ● ●
●● ●
● ● ● ●
● ● ● ●
● ●
● ● ●

● ●● ● ●
● ●
5 10

● ● ●
● ●
● ● ●
● ●●
● ● ●

● ● ● ●
● ●
● ●

● ● ●
● ● ●
● ● ●

● ● ● ●




● ●


● ● ● ●


● ●


2

● ●

−300 −200 −100 0 100 0 100 200 300 400 500

Distance from Monchegorsk [km] Distance from coast [km]

Figure 8.11 Cu in moss and the O- and C-horizon soils along the east–west transect through
Monchegorsk (compare Figure 6.6) and along the north–south transect at the western project boundary
(compare Figure 6.7)

Pb Kola Project Nikel/Zapolj. Mn


● Monchegorsk
Moss ●
Other
0.2

0.2


0.2

0.2

● ● ●
●●

● ●

●●

0.8 0.8
●●

● ●
●●
●●● ●



● ●● ● ● ● ● ●
● ●●●
●●●● ●
●● ●

● ● ● ● ●●

● ● ●
● ●● ●
●●●● ●● ●
● ●●●● ● ● ● ●
●● ●●
● ● ● ●●
●● ● ●● ●●● ●
● ●●
● ●
● ●●●●●●
●●
● ●
●● ● ● ● ●● ●


● ● ● ●
● ●● ●
● ●●

0.4

0.4

●●● ● ● ●
● ●● ●
●● ●●
0.4

0.4

● ●●
● ● ●● ●● ●
● ●● ●● ● ● ● ●
● ● ●
● ● ●
●● ● ● ● ● ● ●
● ● ● ●
●●● ● ● ●
● ● ●
●● ●●● ●
●●
● ● ●

0.6 ●


● ●
0.6
●●

●●
●●●


●●
● ●●
● ●

● ● ● ●●●●
●●




● ● ●


● ● ●●
● ● ● ●
● ● ● ● ● ●
● ● ● ●● ●

● ● ● ●●● ●●

● ●●
● ●

●●
● ●●●●● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ●●
● ●
● ● ● ●
● ●
● ●● ● ● ● ●
● ● ● ●●
● ● ● ●
● ● ●
● ● ●

● ●●
● ●
● ● ●
● ● ● ●●
● ●
● ●● ● ●

●● ● ●
0.6

0.6

● ● ● ● ●


● ● ● ●


● ● ●


● ●● ●
●●
0.6

0.6

●● ● ●




● ● ● ● ●
● ● ● ● ●● ●●
● ● ●
● ● ● ●


● ●

0.4 0.4
● ● ●
● ● ●


● ●● ● ● ● ● ● ● ●
● ●
●● ● ● ● ● ● ● ● ● ●
● ● ● ● ●●
● ● ●● ●
● ●
● ● ●
● ● ● ● ● ●


● ● ● ● ●
● ●
● ● ● ● ● ● ●

● ●
● ●● ●


● ●
●● ●●● ● ●
●● ●● ●● ● ●



● ● ●●● ● ● ●●
●● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ●●● ● ● ●
● ● ● ●

● ●
●● ●● ● ●● ● ●
● ● ● ●

●●
● ●●
●●● ● ● ●● ● ●
● ● ● ●● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ●

● ● ● ● ●● ● ●

0.8

0.8

● ● ●●● ● ● ● ●
● ●

● ●
● ● ●● ●● ● ● ● ●
● ●● ● ●

● ●
● ● ●
● ●●●
0.8

0.8

● ● ●● ● ● ●

● ●●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ●
●● ● ● ● ● ●
● ● ● ● ● ●● ● ●● ●
● ●● ● ●
● ●● ●●● ● ●

● ●●●
● ● ● ●
● ● ●
● ●● ● ●
● ● ● ● ● ●
● ● ● ●●●
●● ●
●● ● ● ●
● ● ● ●
● ● ● ●● ●● ● ●
●● ●
● ● ●
● ● ● ● ●●
● ●

0.2 0.2
● ● ● ●● ● ●
●● ●



● ● ● ● ● ● ●


● ●● ● ●
● ●● ●● ● ●● ● ●


● ● ● ● ●
● ● ●●

● ● ● ● ●
●● ●
● ● ● ●
●● ●●● ● ● ●

● ●●
● ● ● ● ●

● ● ● ● ●● ●
●● ● ● ●● ● ● ● ● ●
● ● ●● ● ●● ●● ● ●
●●

● ●● ●
●● ●●
● ● ●
● ●● ● ● ● ● ●
● ●●● ●

● ●●
●● ●

● ●
● ● ●
● ● ●● ● ● ● ●
●● ● ● ● ●

● ●

● ● ● ●● ● ●
● ● ●●


●●
● ●● ●●

●●
● ●●


● ●● ● ●●


●●●
● ● ●

●● ● ● ●

●● ● ● ●● ●●● ● ● ● ● ●


●●

● ●
● ●

● ●
● ●●● ● ●


●●
● ●

● ●
● ●

● ● ●
● ●● ●

●●

●● ●

●● ● ●●

● ● ●

●●●●●
● ● ● ●


● ●
● ●

● ●

Ni Cu Al Fe

Figure 8.12 Two ternary diagrams plotted with data from the Kola moss data set
Caledonian sediments Alkaline rocks Granites

Ba Ca Cu K La Na Pb Sr Th Ti Y Zn

Figure 12.9 Parallel coordinates plot using three lithology-related data subsets (lithology 9, Caledo-
nian sediments; lithology 82, alkaline rocks; and lithology 7, granites – see geological map Figure 1.2)
from the Kola C-horizon soil data set

Percentile Symbol

● ● outliers ●

100
● ●

non−outliers
●● ●
● ●
● ●

break
N N
● ●



● ●
● ●
● ● ●
● ●
● ● ●


● ●


75
● ●




● ●
● ● ●
● ● ●

● ● ●


● ●
● ●
● ● 50
● ● ●

● ●
● ●

● ●

● ● ●
● ●
● ●
● ● 25
● ●


● ●
● ●


● ●


● ●
● ●
● ●

● ●





● ● ● 0


● ● ●
● ●



● ● ● ● ●
● ● ●
● ● ●
●● ● ●

● ●


● ● ● ● ●

● ● ●●



● ● ●


● ●
● ●

● ●
● ● ● ●

● ●




● ● ● ●

● ●
● ● ●

● ●

● ●
● ●
● ● ●


● ● ●
● ● ● ● ●



● ●

● ●

● ●

● ●
● ●



● ● ● ●
● ● ●
● ● ● ● ●

● ●
● ●

●●



● ● ●
● ●
● ● ●
● ●

● ● ● ●
● ●


● ● ● ● ●


● ● ● ●● ● ● ●


● ● ● ●




● ● ●
● ● ●

● ● ● ●
● ● ● ●

● ●
● ● ●

● ●
● ● ●

● ● ● ●●
● ● ● ●

● ● ●




● ● ● ●
● ●
● ●


● ● ●



● ●

● ●●

● ●



● ● ● ● ●

● ●



● ● ●
● ●
● ● ●

● ● ● ●
● ● ● ●

● ● ●
● ●


● ●

● ●
● ●




● ●●
● ● ● ●

● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ●



● ● ●

● ● ● ● ●

● ● ●
● ●

● ●
● ●



● ● ●
● ●

●●
● ●

● ●

● ● ● ● ● ●


● ● ● ●
● ●
● ● ●
● ● ● ●


● ● ●

● ●
● ● ●
● ● ● ● ● ● ●

● ● ●
● ●

● ● ● ●

● ● ● ● ●


● ●

● ●




● ● ●
● ●

● ●

● ●● ● ● ●
● ● ●
● ● ●
● ● ●

● ●
● ● ● ●

●●
● ●

● ● ●

●●


● ● ● ● ●
● ● ●

● ●
● ● ● ● ●


● ● ● ●
● ● ● ●

● ● ●● ● ●
● ●

● ● ● ●

● ●

● ● ● ● ● ●



● ●

● ●●
● ● ●
● ● ● ●
● ● ●

● ●
● ● ● ●

● ● ●

● ●

● ●
● ●
● ● ● ●
● ●
● ● ● ● ●

● ●
● ● ●

● ●●●

● ● ● ●

● ● ●
● ●
● ● ● ●

● ● ●

● ● ●
● ●
● ● ●
● ● ● ● ●
● ●




● ●

● ● ●
● ● ● ● ● ●


● ●

● ●


● ● ●

● ●
● ● ● ● ● ● ●
● ●●
● ● ●


● ● ●


● ● ● ●


● ●
● ●


● ● ●

● ●

● ● ● ● ● ●

● ● ●

● ●
● ●




● ●

● ●
● ●
● ● ● ● ●


● ● ● ● ●
● ● ●
● ● ●

● ● ●
● ● ●



● ●

● ●

● ● ● ●


● ● ● ● ●
● ●
● ● ●

● ●
● ●
● ● ●

● ●
● ●


● ●

● ●
● ● ●
● ● ● ●
● ●

● ●●
● ●




● ● ● ●


0 50 100 km 0 50 100 km

Figure 13.5 Maps identifying multivariate (p = 7) outliers in the Kola Project moss data. The left map
shows the location of all samples identified as multivariate outliers, the right map includes information
on the relative distance to the multivariate data centre (different symbols) and the average element
concentration at the sample site (colour, here grey-scale, intensity)
Original Groups Predicted as boreal−forest
boreal−forest Field classification:
● ● ●
forest−tundra
N ●

● ●
● ● ●
● ●
tundra N ●

● ●
● ● ●
● ● ●
boreal−forest
forest−tundra
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● tundra
● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●
●● ● ●● ●
● ● ● ● ● ●
● ● ●● ● ● ●●
● ● ● ●
● ● ● ●● ●● ● ● ● ● ● ●● ●● ● ●
● ● ● ● ● ● ● ●
● ● ● ●● ● ● ● ●●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ●● ● ● ● ●
● ● ● ● ● ●● ●
● ● ● ●
● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●

● ●

0 50 100 km 0 50 100 km

Predicted as forest−tundra Predicted as tundra


Field classification: Field classification:
● ●

N ●

● ●
● ● ●
● ● ●
boreal−forest
forest−tundra
N ●

● ●
● ● ●
● ● ●
boreal−forest
forest−tundra
● ● ● ● ● ● ● ● ● ●
● ● ● tundra ● ● ● tundra
● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●
●● ● ●● ●
● ● ● ● ● ●
● ● ●● ● ● ●●
● ● ● ●
● ● ● ●● ●● ● ● ● ● ● ●● ●● ● ●
● ● ● ● ● ● ● ●
● ● ● ●● ● ● ● ●●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ●● ● ● ● ●
● ● ● ● ● ●● ●
● ● ● ●
● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●

● ●

0 50 100 km 0 50 100 km

Figure 17.3 Map showing the distribution of the vegetation zones in the Kola Project survey area
(upper left) and maps of the predicted distribution of the three vegetation zones in the survey area
using LDA with cross-validation. The grey scale corresponds to the probability of assigning a sample to
a specific group (boreal-forest, upper right; forest-tundra, lower left; tundra, lower right)
Original Groups ●
Classical 0.01, all variables

boreal−forest boreal−forest

● forest−tundra ● forest−tundra
N ●
● ●
● ● ● ●
● ●
tundra N ●●




tundra
● ● ● ● ● ● ● ● ● ●

no allocation
● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ●
● ● ●● ● ●●
● ● ●● ● ●● ●

● ● ●● ●● ● ● ● ● ● ●● ●● ● ●
● ● ● ● ● ● ●● ●
● ● ●
● ● ●●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●●
● ● ●

● ● ●● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●

● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ●● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ●
● ●
● ● ●
● ●

● ● ●
● ● ● ●

● ● ●

● ● ● ● ● ● ●

● ● ● ● ●

● ● ●

● ● ●

● ● ●

● ●

● ●

0 50 100 km 0 50 100 km


MCD 0.01, selected variables ●
MCD 0.01, all variables
● ● ● ●

● boreal−forest boreal−forest

● ●

● forest−tundra ● forest−tundra
N ●




● ●



tundra N ●



● ●

● ●
tundra
● ● ● ● ● ●

no allocation ●
● ● ● ● ●

no allocation
● ● ● ●

● ● ●
● ● ● ● ● ●
● ● ●
● ●

● ● ● ● ● ● ●
● ● ● ● ●

● ● ●● ● ● ●● ● ● ●

● ● ● ● ●
●● ● ●

● ● ● ●● ● ● ● ● ●● ●● ● ● ●


● ● ● ● ●
● ● ●● ●
● ● ● ● ●
● ● ● ●

● ● ● ● ● ● ● ● ●● ● ● ●
● ● ●
● ●
● ● ● ●●
● ● ● ● ● ●

● ●
● ● ●
●● ● ● ● ● ●
●● ●
● ● ● ●● ●

● ● ●
● ● ●● ●
● ●
● ● ●

● ● ●
● ● ● ● ● ● ● ●
● ● ●
● ●● ● ●

● ●

● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ●
● ● ●

●● ● ● ●
● ● ● ●

● ●● ● ● ● ●
●●
● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ●
● ●
● ● ● ●
● ●
● ● ●
● ●
● ● ● ● ● ● ●
● ●

● ● ● ●
● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ●


● ● ● ●
● ●

● ● ● ●
● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ●

● ●

● ●

● ●

● ●

● ● ● ● ● ●

● ● ● ● ●

● ●

● ● ● ●

● ●

● ● ●

0 50 100 km 0 50 100 km

Figure 17.6 Map showing the distribution of the vegetation zones in the Kola Project survey area
(upper left) and maps of the predicted distribution of the three vegetation zones in the survey area
using allocation. Classical non-robust estimates were used in the upper right map, robust estimates
(MCD) were used for the map below (lower right), and in the lower left map the allocation was based
on a subset of 11 elements using robust estimates
1
Introduction

Statistical data analysis is about studying data – graphically or via more formal methods.
Exploratory Data Analysis (EDA) techniques (Tukey, 1977) provide many tools that transfer
large and cumbersome data tabulations into easy to grasp graphical displays which are widely
independent of assumptions about the data. They are used to “visualise” the data. Graphical
data analysis is often criticised as non-scientific because of its apparent ease. This critique
probably stems from many scientists trained in formal statistics not being aware of the power
of graphical data analysis.
Occasionally, even in graphical data analysis mathematical data transformations are useful
to improve the visibility of certain parts of the data. A logarithmic transformation would be
a typical example of a transformation that is used to reduce the influence of unusually high
values that are far removed from the main body of data.
Graphical data analysis is a creative process, it is far from simple to produce informative
graphics. Among others, choice of graphic, symbols, and data subsets are crucial ingredients
for gaining an understanding of the data. It is about iterative learning, from one graphic to
the next until an informative presentation is found, or as Tukey (1977) said “It is important to
understand what you can do before you learn to measure how well you seem to have done it”.
However, for a number of purposes graphics are not sufficient to describe a given data set.
Here the realms of descriptive statistics are entered. Descriptive statistics are based on model
assumptions about the data and thus more restrictive than EDA. A typical model assumption
used in descriptive statistics would be that the data follow a normal distribution. The normal
distribution is characterised by a typical bell shape (see Figure 4.1 upper left) and depends on
two parameters, mean and variance (Gauss, 1809). Many natural phenomena are described by
a normal distribution. Thus this distribution is often used as the basic assumption for statisti-
cal methods and estimators. Statisticians commonly assume that the data under investigation
are a random selection of many more possible observations that altogether follow a normal
distribution. Many formulae for statistical calculations, e.g., for mean, standard deviation and
correlation are based on a model. It is always possible to use the empirical data at hand and the
given statistical formula to calculate “values”, but only if the data follow the model will the val-
ues be representative, even if another random sample is taken. If the distribution of the samples
deviates from the shape of the model distribution, e.g., the bell shape of the normal distribu-
tion, statisticians will often try to use transformations that force the data to approach a normal

Statistical Data Analysis Explained: Applied Environmental Statistics with R. C. Reimann, P. Filzmoser, R. G. Garrett,
R. Dutter © 2008 John Wiley & Sons, Ltd. ISBN: 978-0-470-98581-6
2 INTRODUCTION

distribution. For environmental data a simple log-transformation of the data will often suffice
to approach a normal distribution. In such a case it is said that the data come from a lognormal
distribution.
Environmental data are frequently characterised by exceptionally high values that deviate
widely from the main body of data. In such a case even a data transformation will not help to
approach a normal distribution. Here other statistical methods are needed, that will still provide
reliable results. Robust statistical procedures have been developed for such data and are often
used throughout this book.
Inductive statistics is used to test hypotheses that are formulated by the investigator. Most
methods rely heavily on the normal distribution model. Other methods exist that are not based
on these model assumptions (non-parametric statistical tests) and these are often preferable
for environmental data.
Most data sets in applied earth sciences differ from data collected by other scientists (e.g.,
physicists) because they have a spatial component. They present data for individual specimens,
termed as samples by earth scientists, that were taken somewhere on Earth. Thus, in addition to
the measured values, for example geochemical analyses, the samples have spatial coordinates.
During data analysis of environmental and earth science research results this spatial component
is often neglected, to the detriment of the investigation. At present there exist many computer
program systems either for data analysis, often based on classical statistics that were developed
for physical measurements, or for “mapping” of spatial data, e.g., geographical information
systems (GIS). For applied earth sciences a data analysis package that takes into account the
special properties of spatial data and permits the inclusion of “space” in data analysis is needed.
Due to their spatial component, earth science and environmental data have special properties
that need to be identified and understood prior to statistical data analysis in order to select the
“right” data analysis techniques. These properties include:

 The data are spatially dependent (the closer two sample sites are the higher the probability
that the samples show comparable analytical results) – all classical statistical tests assume
independence of individuals.
 At each sample site a multitude of different processes can have had an influence on the
measured analytical value (e.g., for soil samples these include: parent material, topog-
raphy, vegetation, climate, Fe/Mn-oxyhydroxides, content of organic material, grain size
distribution, pH, mineralogy, presence of mineralisation or contamination). For most sta-
tistical tests, however, it is necessary that the samples come from the same distribution –
this is not possible if different processes influence different samples in different propor-
tions. A mixture of results caused by many different underlying processes may mimic a
lognormal data distribution – but the underlying “truth” is that the data originate from mul-
tiple distributions and should not be treated as if they were drawn from a single normal
distribution.
 Like much scientific data (consider, for example, data from psychological investigations),
applied earth sciences data are imprecise. They contain uncertainty. Uncertainty is unavoid-
ably introduced at the time of sampling, sample preparation and analysis (in psychological
investigations some people may simply lie). Classical statistical methods call for “precise”
data. They will often fail, or provide wrong answers and “certainties”, when applied to im-
precise data. In applied earth sciences the task is commonly to optimally “visualise” the
results. This book is all about “visualisation” of data behaviour.
 Last but not least environmental data are most often compositional data. The individual vari-
ables are not independent of each other but are related by, for example, being expressed as a
INTRODUCTION 3

percentage (or parts per million – ppm (mg/kg)). They sum up to a constant, e.g., 100 percent
or 1. To understand the problem of closed data, it just has to be remembered how percentages
are calculated. They are ratios that contain all variables that are investigated in their denom-
inator. Thus, single variables of percentage data are not free to vary independently. This has
serious consequences for data analysis. Possibilities of how to deal with compositional data
and the effect of data closure are discussed in Chapter 10 (for an in depth discussion see
Aitchison, 1986, 2003; Buccianti et al., 2006).

The properties described above do not agree well with the assumptions of “classical”
(Gaussian) statistics. These are, however, the statistical methods taught in all basic statistics
courses at universities because they are the most fundamental statistical methods. As a
consequence, up to this day techniques that are far from ideal for the data at hand are widely
applied by earth scientists in data analysis. Rather, applied earth science data call for the
use of robust and non-parametric statistical methods. Instead of “precise” statistics so called
“simple” exploratory data analysis methods, as introduced by Tukey in 1977, should always
be the first choice. To overcome the problem of “closed data” the data array may have to be
opened (see Section 10.5) prior to any data analysis (note that even graphics can be misleading
when working with closed data). However, working with the resulting ratios has other severe
shortcomings and at present there is no ideal solution to the problems posed by compositional
data. All results based on correlations should routinely be counterchecked with opened data.
Closure cannot be overcome by not analysing the major components in a sample or by not
using some elements during data analysis (e.g. by focussing on trace elements rather than
using the major elements). Even plotting a scatterplot of only two variables can be severely
misleading with compositional data (compare Figures 10.7 and 10.8).
Graphical data analyses were largely manual and labour intensive 30 or 40 years ago,
another reason why classical statistical methods were widely used. Interactive graphical data
analysis has become widely available in these days of the personal computer – provided the
software exists to easily prepare, modify and finally store the graphics. To make full use of
exploratory data analysis, the software should be so powerful and convenient that it becomes
fun to “play” with the data, to look at the data in all kinds of different graphics until one
intuitively starts to understand their message. This leads to a better understanding about what
kind of advanced statistical methods can and should (or rather should not) be applied to the data
at hand. This books aims at providing such a package, where it becomes easy and fun to “play”
with spatial data and look at them in many different ways in a truly exploratory data analysis
sense before attempting more formal statistical analyses of the data. Throughout the book it
is demonstrated how more and more information is extracted from spatial data using “simple”
graphical techniques instead of advanced statistical calculations and how the “spatial” nature
of the data can and should be included in data analysis.
Such a data analysis and mapping package for spatial data was developed 20 years ago
under the name “DAS” (Data Analysis System – Dutter et al., 1990). A quite similar system,
called IDEAS, was used at the Geological Survey of Canada (GSC) (Garrett, 1988). DAS was
only available for the DOS environment. This made it more and more difficult to run for people
used to the Microsoft Windows operating system. During recent years, R has developed into
a powerful and much used open source tool (see: http://www.r-project.org/) for
advanced statistical data analysis in the statistical community. R could actually be directly
used to produce all the tables and graphics shown in this book. However, R is a command-line
language, and as such it requires more training and experience than the average non-statistician
will usually have, or be willing to invest in gaining. The R-scripts for all the graphics and tables
4 INTRODUCTION

shown in this book are provided and can be used to learn R and to produce these outputs with
the reader’s own data. The program package accompanying this book provides the link between
DAS and R and is called DAS+R. It uses the power of R and the experience of the authors in
practical data analysis of applied geoscience data. DAS+R allows the easy and fast production
of tables, graphics and maps, and the creation and storage of data subsets, through a graphical
user interface that most scientists will find intuitive and be able to use with very little training.
R provides the tools for producing and changing tables, graphics and maps and, if needed, the
link to some of the most modern developments in advanced statistical data analysis techniques.
To demonstrate the methods, and to teach a user not trained to think “graphically” in data
analysis, an existing multidimensional data set from a large environmental geochemical map-
ping project in the European Arctic, the Kola Ecogeochemistry Project (Reimann et al., 1998a),
is used as an example. The Kola Project data include many different sample materials, more
than 60 chemical elements were determined, often by several different analytical techniques.
The book provides access to the data set and to the computer scripts used to produce the
example statistical tables, graphics and maps. It is assumed that the reader has a spreadsheet
package like Microsoft ExcelTM or Quattro ProTM installed and has the basic knowledge to
effectively use a spreadsheet program.
The Kola Project data set is used to demonstrate step by step how to use exploratory data
analysis to extract more and more information from the data. Advantages and disadvantages
of certain graphics and techniques are discussed. It is demonstrated which techniques may
be used to good effect, and which should better be avoided, when dealing with spatial data.
The book and the software can be used as a textbook for teaching exploratory data analysis
(and many aspects of applied geochemistry) or as a reference guide to certain techniques. The
program system can be used with the reader’s own data, and the book can then be used as
a “handbook” for graphical data analysis. Because the original R scripts are provided for all
tables, graphics, maps, statistical tests, and more advanced statistical procedures, the book can
also be used to become familiar with R programming procedures.
Because many readers will likely use the provided software to look at their own data rather
than to study the Kola Project data, the book starts out with a description of the file structure
that is needed for entering new data into R and DAS+R (Chapter 2). Some common problems
encountered when editing a spreadsheet file as received from a laboratory to the DAS+R
(or R) format are discussed in the same chapter. A selection of graphics for displaying data
distributions are introduced next (Chapter 3), before the more classical distribution measures
are introduced and discussed (Chapter 4). The spatial structure of applied earth science data
should be an integral part of data analysis and thus spatial display, mapping, of the data is
discussed next (Chapter 5) and before further graphics used in exploratory data analysis are
introduced (Chapter 6). A “classical” task in the analysis of applied geochemical data is the
definition of background and threshold, coupled with the identification of “outliers” and of
element sources – techniques used up to this time are discussed in their own chapter (7). A
key component of exploratory data analysis lies in comparing data. The use of data subsets is
an especially powerful tool. The definition of subsets of data and using a variety of graphics
for comparing them are introduced in Chapter 8, while Chapter 9 covers the more formal
statistical tests. Statistical tests often require that the data are drawn from a normal distribution.
Many problems can arise when using formal statistics with applied earth science data that
are not normally distributed or drawn from multiple statistical populations. Chapter 10 covers
techniques that may be used to improve the data behaviour for statistical analysis as a prepara-
tion for entering the realms of multivariate data analysis. The following chapters cover some of
THE KOLA ECOGEOCHEMISTRY PROJECT 5

the widely used multivariate techniques such as correlation analysis (Chapter 11), multivariate
graphics (Chapter 12), multivariate outlier detection (Chapter 13), principal component and
factor analysis (Chapter 14), cluster analysis (Chapter 15), regression analysis (Chapter 16)
and discriminant analysis (Chapter 17). In all chapters the advantages and disadvantages of the
methods as well as the data requirements are discussed in depth. Chapter 18 covers different
aspects of an integral part of collecting data in applied earth sciences: quality control. One
could argue that Chapter 18 should be at the front of this book, due to its importance. However,
quality control is based on graphics and statistics that needed to be introduced first and thus it
is treated in Chapter 18, notwithstanding the fact that it should be a very early consideration
when designing a new project. Chapter 19 provides an introduction to R and the R-scripts
used to produce all diagrams and tables in this book. The program system and graphical
user interface under development to make R easily accessible to the non-statistician is also
explained.
The following books can be suggested for further reading. Some cover “graphical statistics”,
some the interpretation of geoscience and environmental data, some give an introduction to the
computer language S (the base of R) and the commercial software package S-Plus (comparable
to R), and several provide the mathematical formulae that were consciously avoided in this
book. Davis (1973, 2002) is still one of the “classic” textbooks about computerised data analysis
in the geosciences. Tukey (1977) coined the term “Exploratory Data Analysis” (EDA) and
introduced a number of powerful graphics to visualise data (e.g., the boxplot). Velleman and
Hoaglin (1981) provide an introduction to EDA including the computer codes for standard
EDA methods (Fortran programs). Chambers et al. (1983) contains a comprehensive survey of
graphical methods for data analysis. Rollinson (1993) is a classical textbook on data analysis
in geochemistry, the focus is on interpretation rather than on statistics. Rock (1988), and Helsel
and Hirsch (1992) provide an excellent compact overview of many statistical techniques used
in the earth sciences with an early focus on robust statistical methods. Cleveland’s papers
(1993, 1994) are general references for visualising and graphing data. Millard and Neerchal
(2001) provide an extensive introduction to environmental statistics using S-Plus. Venables
and Ripley (2002) explain a multitude of statistical methods and their application using S.
Murrell (2006) provides an excellent and easy to read description of the graphics system
in R.

1.1 The Kola Ecogeochemistry Project

The Kola Ecogeochemistry Project (web site http://www.ngu.no/Kola) gathered


chemical data for up to more than fifty chemical elements from four different primary sample
materials (terrestrial moss, and the O-, B-, and C-horizon of podzolic soils) in parts of north-
ern Finland, Norway and Russia. Two additional materials were collected for special purposes
(Topsoil: 0–5 cm, and lake water – the latter in the Russian survey area only). The size of the sur-
vey area in the European Arctic (Figure 1.1) was 188 000 km2 . The four primary materials were
collected because in combination they can reflect atmospheric input (moss), interactions of the
biosphere with element cycles (moss and O-horizon), the atmosphere–biosphere–lithosphere
interplay (O-horizon), the influence of soil-forming processes (B-horizon), and the regional
geogenic background distribution (the lithosphere) (C-horizon) for the elements investigated.
Topsoil was primarily collected for the determination of radionuclides, but later a number of
additional parameters were determined. Lake water reflects the hydrosphere, samples were
6 INTRODUCTION

Figure 1.1 Location of the Kola Ecogeochemistry Project survey area

collected in Russia only because the 1000 lakes project (Henriksen et al., 1996, 1997) col-
lected lake water samples over all of Scandinavia at the same time (1995). All results for the
four primary sample materials and topsoil are documented in the form of a geochemical atlas
(Reimann et al., 1998a). Lake water geochemistry is documented in a number of publications
in international journals (see, e.g., Reimann et al., 1999a, 2000a).
The main aim of the project was the documentation of the impact of the Russian nickel
industry on the vulnerable Arctic environment. The result was a database of the concentration
of more than 50 chemical elements in the above sample materials, reflecting different com-
partments of the ecosystem, in the year 1995. Each material can be studied for itself, the main
power of the project design, however, lies in the possibility to directly compare results from
all the different sample materials at the same sites.
This book provides access to all the regional Kola data, including topsoil and lake waters.
Throughout the book examples were prepared using these data and the reader is thus able to
reproduce all these diagrams (and many others) by using DAS+R and the Kola data. There are
certainly many features hidden in the data sets that have not yet been covered by publications.
Feel free to use the data for your own publications, but if a fellow scientist wants to use these
data for publications, due reference should be given to the original source of the data (Reimann
et al., 1998a).

1.1.1 Short description of the Kola Project survey area


The survey area is described in detail in the geochemical atlas (Reimann et al., 1998a). This
book also provides a multitude of maps, which can be helpful when interpreting the data.
THE KOLA ECOGEOCHEMISTRY PROJECT 7

The project covered the entire area north of the Arctic Circle between longitudes 24◦ and
35.5◦ east and thence north to the Barents Sea (Figure 1.1). Relative to most of Europe, the
Finnish and Norwegian parts of the area are still almost pristine. Human activities are mostly
limited to fishery, reindeer-herding and forestry (in the southern part of the project area). Excep-
tions are a large iron ore mine and mill at Kirkenes, N-Norway; a small, brown coal-fired power
station near Rovaniemi at the southern project border in Finland and some small mines. Popula-
tion density increases gradually from north to south. In contrast, the Russian part of the project
area is heavily industrialised with the nickel refinery at Monchegorsk, the nickel smelter at Nikel
and the Cu/Ni-ore roasting plant at Zapoljarnij, which are three of the world’s largest point-
source emitters of SO2 and Cu, Ni and Co and a number of other metals. These three sources
together accounted for emissions of 300 000 t SO2 , 1900 t Ni and 1100 t Cu in 1994 (Reimann
et al., 1997c). Apatite ore is mined and processed near Apatity, iron ore at Olenegorsk and
Kovdor, Cu-Ni-ore near Zapoljarnij. An aluminium smelter is located near Kandalaksha. The
major towns of Murmansk and Apatity have large oil- and coal-fired thermal heating and power
plants.
Topographically, large parts of the area can be characterised as highlands. In Norway, the
general landscape in the coastal areas is quite rugged, and the mountains reach elevations of
700 m above sea level (a.s.l.). In Russia, in the south-western part of the Kola Peninsula, there
are mountains reaching 200–500 m a.s.l. Near Monchegorsk and Apatity and near the coast of
the White Sea there are some higher mountains (over 1000 m a.s.l.).
The geology of the area is complex and includes a multitude of different bedrock
types (Figure 1.2). Some of the rock types occurring in the area are rare on a global
scale and have unusual geochemical compositions. The alkaline intrusions that host the fa-
mous apatite deposits near Apatity are an example. The main rock types in the area that
are occasionally mentioned in later chapters due to their special geochemical signatures,
are:

 Sedimentary rocks of Caledonian (c. 1600–400 million years (Ma) old) and Neoproterozoic
(1000–542 Ma) age that occur along the Norwegian coast and on the Rhybachi Peninsula in
Russia (lithologies 9 and 10 in the data files).
 The rocks of the granulite belt that runs from Norway through northern Finland into Russia.
These rocks are of Archean age (2300–1900 Ma) and “foreign” to the area (see Reimann
and Melezhik, 2001). They are subdivided into “felsic” and “mafic” granulites (lithologies
31 and 32 in the data files).
 Diverse greenstone belts, which occur throughout the area. These are Palaeoproterozoic
(2400–1950 Ma) rocks of volcanic origin (lithologies 51 and 52 in the data files). These
rocks host many of the ore occurrences in the area, e.g., the famous Cu-Ni-deposits near
Zapoljarnij in Russia.
 Alkaline and ultramafic alkaline intrusions of Palaeoproterozoic to Palaeozoic age (1900–
470 Ma) (see above). These host the important phosphate deposits near Apatity (lithologies
81, 82 and 83 in the data files).
 Granitic intrusions of Palaeoproterozoic age (1960–1650 Ma), occurring in the south-western
corner of the survey area in Finland and as small bodies throughout the survey area (lithology
7 in the data files).
 Large gneiss masses of Archean (3200–2500 Ma) and uncertain age (lithologies 1, 4 and 20
in the data files) that do not show any geochemical peculiarities.
8 INTRODUCTION

Figure 1.2 Geological map of the Kola Project survey area (modified from Reimann et al., 1998a). A
colour reproduction of this figure can be seen in the colour section, positioned towards the centre of
the book
THE KOLA ECOGEOCHEMISTRY PROJECT 9

The study area is part of the glaciated terrain of Northern Europe. The main Quaternary
deposits are till and peat. There are also large areas without any surficial cover, dominated by
outcrops and boulder fields (Niemelä et al., 1993).
The north–south extent of the survey area is about 500 km. Within this distance, three
vegetation zones gradually replace each other (for a vegetation map in colour see Kashulina
et al., 1997; Reimann et al., 1998a). The southern and central parts of the area fall into the
northern boreal coniferous zone. Towards the north, this zone gradually gives way to subarctic
birch forest, followed by the subarctic tundra zone close to the coast of the Barents Sea. These
changes in vegetation zones can also occur with altitude. Major characteristics of the forest
ecosystems in this area are the sparseness of the tree layer, a slow growth rate, and the large
proportion of ground vegetation in the total biomass production.
The dominant soil-forming process in forested and treeless regions of Northern Europe is
podzolisation of mineral soils (Borggaard, 1997). Podzols are thus the most important soil type
present throughout the survey area. Soils in the area are young. Their age ranges between 5000
and 8000 years, having formed since the retreat of the continental ice sheet in Northern Europe.
A typical podzol profile consists of five main layers, the O (organic), E (eluvial), B (illuvial),
BC (transitional) and C (weathered parent)-horizon. Colour photographs of typical podzol
profiles from the area can be found in Reimann et al. (1998a) or on the Kola Project web site:
http://www.ngu.no/Kola/podzol.html. The O-horizon of podzol is characterised
by a low pH-value (MEDIAN in the survey area 3.85 – Reimann et al., 1998a). pH increases
systematically with depth. In the C-horizon it reaches a MEDIAN value of 5.8. The O-horizon
varies in thickness from less than 0.5 cm to more than 20 cm, the MEDIAN for the survey area
being 2.5 cm. The thickness of the soil profiles can vary considerably – the depth to the top of
the C-horizon varies between 10 and 123 cm, the MEDIAN being 35 cm (maps for all these
parameters are presented in Reimann et al., 1998a).
Major climatic differences occur within the survey area. In the northwest, summers are cool
and winters mild, with most precipitation between September and January (coastal climate). In
the central part, warm summers and relatively cold winters are typical; the main precipitation
takes the form of rain showers in July and August (continental climate). The easternmost part is
similar to the central part, but with colder winters. Precipitation is in general low. The average
for the whole area is <500 mm/year. The yearly average temperature is −1◦ C. Maps showing
average rainfall and temperature for summer and winter can be found in Reimann et al. (1998a).

1.1.2 Sampling and characteristics of the different sample materials


Sampling of the 188 000 km2 area took place from July to September 1995. The av-
erage sample density was 1 site per 300 km2 . Samples were collected from 617 sites.
Depending on availability the total number of samples, n, ranges for any one sample
material from 594 (moss) to 617 (O-horizon). Sample site selection for the project was
such that only locations where podzol had developed on glacial drift were visited, giving
genetically comparable samples over the whole area. At each site the following samples
were collected (for more detailed information see Äyräs and Reimann, 1995; Reimann
et al., 1998a):

Moss (n = 594): Terrestrial moss, preferably the species Hylocomium splendens, and if it
was absent Pleurozium schreberi, was collected. Only shoots representing the previous
three years’ growth were taken.
10 INTRODUCTION

Terrestrial moss (Hylocomium splendens and Pleurozium schreberi) receives most of its
nutrients from the atmosphere. In Scandinavia it has been used to monitor the atmospheric
deposition of chemical elements for more than 30 years (see, e.g., Rühling and Tyler,
1968, 1973; Tyler, 1970). It was thus included in the Kola Project to reflect the input
of elements from the atmosphere (wet and dry deposition, including both geogenic and
anthropogenic dust) over the previous 2–3 years. At the same time the moss reflects
element concentrations in an important component of the arctic ecosystem in terms of
total biomass production and ecological function. It is an important supplier of litter for
the formation of the organic horizon of the soils (Kashulina et al., 1997). Results of an
interspecies comparison between Hylocomium splendens and Pleurozium schreberi are
presented in Halleraker et al. (1998). A review of the moss technique and a discussion of
possible problems related to it as well as a comparison with the chemical composition
of lichen and crowberry from the Kola Project area are given in Reimann et al. (1999b,
2001c). Results of regional mapping are also discussed in Äyräs et al. (1997a); de Caritat
et al. (2001); Reimann et al. (1997b).

O-horizon (n = 617): This was collected with a custom-built tool to facilitate and systematise
sampling, and to allow easy measurement of the volume of each sample (Äyräs and Reimann,
1995). For the O-horizon samples only the top 3 cm of the organic layer were taken. If the
total thickness of the O-horizon was less than 3 cm, only the organic layer was sampled, and
the thickness was recorded on the field notes. From seven to ten sub-samples were collected
at each site to give a composite sample with a minimum volume of one litre.
The O-horizon consists mostly of plant residues in differing stages of decay and humus,
almost inevitably mixed with some minerogenic particles. Due to its location and genesis,
the organic horizon reflects the complex interplay between the lithosphere, the biosphere
and the atmosphere. The O-horizon is a major sink for plant nutrients in northern ecosys-
tems. It can accumulate and enrich many chemical elements, e.g., via organic complexing
and the formation of metallo-organic compounds. For many elements, both from natural
and anthropogenic sources, it thus acts as a very effective “geochemical barrier” as defined
first by Goldschmidt (1937). A separate interpretation of the regional distribution patterns
found in the O-horizon from the survey area is given in Äyräs and Kashulina (2000). Other
publications discussing special properties of the O-horizon include Reimann et al. (1998c,
2000b, 2001a).

Podzol profiles (B-horizon, n = 609; C-horizon, n = 605): Before sampling a podzol pro-
file, homogeneity of the soil cover was checked over an area of 10 × 10 m. The exact
location of the profile was chosen so that both ground vegetation and micro-topography
were representative. The sampling pits were dug by spade to the C-horizon. Samples of
the O-, E-, B-, BC- and C-horizons were collected, starting from the bottom to avoid con-
tamination and mixing of the horizons. With the exception of the C-horizon, each layer
was sampled over its complete thickness. If there were distinguishable layers within the
B-horizon, these were collected in the same ratio as present in the profile. Each sample
weighed about 1–1.5 kg, depending on parameters like grain size, mineralogy, and water
content. At present only the samples of the B- and C-horizon have been studied, all other
samples have been archived in air dried condition for future reference.
The B-horizon is clearly affected by soil-forming processes. Compared to the C-horizon
it is relatively enriched in clay minerals, organic matter, and free and organically-bound
THE KOLA ECOGEOCHEMISTRY PROJECT 11

amorphous Fe- and Al-oxides and -hydroxides, which have been leached from the upper
soil horizons. It is less active than the O-horizon, but can still act as a second “geochemical
barrier” for many elements (independent of origin) within the soil profile, e.g., via co-
precipitation with the Fe-oxides/-hydroxides. Some soil-forming processes within podzol
profiles from the Kola Ecogeochemistry Project area are discussed in Räisänen et al. (1997)
and Kashulina et al. (1998a).
The deepest soil horizon of podzols, the C-horizon, is only slightly influenced by soil-
forming processes and sometimes by anthropogenic contamination, and thus mostly
reflects the natural, geogenic, element pool and regional variations therein. A geological
interpretation of the C-horizon results is given in Reimann and Melezhik (2001b).
Topsoil (n = 607): Samples were taken for the analysis of radionuclides. Topsoil samples
(0–5 cm) were collected at all sample sites using the same procedure as for the O-horizon.
These samples were also analysed by Instrumental Neutron Activation Analysis (INAA)
for about 30 elements. These results are not used within this book but provided on the web
page http://www.statistik.tuwien.ac.at/StatDA/R-scripts/data/.

For reasons of completeness, analytical results of lake water samples collected in the Russian
survey area are also included on the web page of the book. These are not used within this
book, interested readers can find a description of the sampling procedures and some first
interpretations of the results in Reimann et al. (1999a).
In addition to the sample media documented in the atlas and provided with this book,
snow, rain water, stream water, lake water, ground water, bedrock, organic stream sediments,
topsoil 0–5 cm, and overbank sediments were studied during different stages of the project
(Äyräs and Reimann, 1995; Äyräs et al., 1997a,b; Boyd et al., 1997; de Caritat et al., 1996a,b,
1997a,b, 1998a,b; Chekushin et al., 1998; Gregurek et al., 1998a,b, 1999a,b; Halleraker et al.,
1998; Kashulina et al., 1997, 1998a,b; Niskavaara et al., 1996, 1997; Reimann et al., 1996,
1997a,b,c, 1998c, 1999a,b, 2000a,b; Volden et al., 1997), often on a spatially detailed scale.
These data can be used to assist in the interpretation of the observed regional features. They
are not provided here but are available from the Norwegian Geological Survey (NGU) upon
request. A complete and regularly updated list of publications with data from the Project can be
found at the following project website: http://www.ngu.no/Kola/publist.html.

1.1.3 Sample preparation and chemical analysis


Detailed descriptions of sample preparation and analysis are provided in Reimann et al. (1998a).
In general, selection of elements, extractions and detection limits was based on a conscious
decision to make all results as directly comparable as possible.
Both the Geological Survey of Finland (GTK) and NGU chemical laboratories and the
methods described above for analysing the Kola samples are accredited according to EN
45001 and ISO Guide 25. All methods have been thoroughly validated and trueness, accuracy
and precision are monitored continuously. Some of the quality control results are presented in
Reimann et al. (1998a) – and in the last chapter of this book. The data files include information
about the analytical method used and the detection limits reached.
2
Preparing the Data for Use in R
and DAS+R

Finally – the long-expected analytical results have arrived from the laboratory. With print-out
in hand and the spreadsheet-file with the results on the computer hard-disk the scientist burns
to look at the data. Are there any interesting patterns and information hidden in the data, and
if yes, what are the locations of the respective individuals/samples?
First things first. It is prudent, some would argue essential, to make a back-up copy of the
received data, giving the file a name that clearly identifies it as the original received data, ideally
making it “read-only”, and placing a copy in a project archive. Depending on the importance of
the project, copies of the project archive, which will grow as other critical files and output are
generated, may be stored on an institutional server, or even on some media, e.g., a CD, off-site.
The original received data should never be modified. If this procedure is not followed, it is
almost guaranteed that serious problems will be encountered when returning to a project after
some months or years of working with other data. It is important that the data, both field and
laboratory, and all related information are captured as soon after they are created as possible.
Otherwise they may get lost over time as scientists move on to other projects, switch computers
or even jobs. In an ideal world the scientist’s institution will have a geochemical database where
analytical results, together with other vital project information, are stored for easy retrieval.
In the less than ideal, but often real, world the scientist will work in an institution without
an established geochemical database and without an Information Technology (IT) support
group. Most likely the scientist has learned to use a spreadsheet like Microsoft ExcelTM or
Quattro ProTM for daily work and would like to keep the data in this format (note that the
input (and output) tables of a database are commonly such spreadsheet files). Although we do
not want to advocate the use of spreadsheets exclusively, we will discuss how to work with
spreadsheets and the pitfalls that exist when working with them. For speed of graphical data
analysis, DAS+R has been constructed in a way that it can handle many situations that would
usually require an additional database program. It is for example possible to keep auxiliary
information like method of analysis, detection limits of the analytical method and comments
concerning the data set, as well as any variable. This auxiliary information, if provided in the
data file, can be read into DAS+R together with the data. It is also possible to link and/or
append data files, to create and store data subsets, and to manipulate variables using different
mathematical operators. Linking and/or appending data in R has the advantage that almost no

Statistical Data Analysis Explained: Applied Environmental Statistics with R. C. Reimann, P. Filzmoser, R. G. Garrett,
R. Dutter © 2008 John Wiley & Sons, Ltd. ISBN: 978-0-470-98581-6
14 PREPARING THE DATA FOR USE IN R AND DAS+R

limits exist as to the final length or width of the data file. Today’s spreadsheet software is limited
as to the allowed width of a file, e.g., many do not allow for more than 256 variables in one file.

2.1 Required data format for import into R and DAS+R

The question could be asked, “what is the problem with using the original data file as received
from the laboratory for data analysis?” Firstly, the laboratory’s result file will likely not contain
the geographic coordinates of the samples, field information, and other ancillary data that may
be required in the following data analyses, nor the “keys” identifying field duplicates, analytical
duplicates, the project standards (control reference materials), or even the original sample site
number. The data file must thus be linked with a “key-file” containing all this information.
Secondly, data will often be received from several different laboratories, and the different files
must be linked somehow. Thirdly, the received files most likely contain a lot of information
that are necessary to know and may be useful to view in a table, but are of no use in statistical
analysis. Tables prepared for “nice looks” are actually often especially unsuited as input for
data analysis software and may require much editing. Thus it is sensible to not spend time on
“table design” at this stage.
However, during data analysis it is often advantageous to have all variables in the same unit
in the data file (for example for direct 1:1 comparisons or if a centred logratio transformation
(see Section 10.5.2) is necessary). For the Kola data mg/kg was used as it was the most frequent
laboratory reporting unit. For example, results of XRF analysis reported as oxides in wt%, e.g.,
Al2 O3 , were re-calculated as element concentrations in mg/kg resulting in the new variable
Al XRF (Figure 2.1). A data file for import into data analysis software packages should not
contain any special signs such as “<”, “>”, “!”, “?”, “%” or characters from a language
other than English, e.g., “å”, “ø”, “ü”, or “␮” (␮g/kg). Sometimes these can cause serious
problems during data import or cause the software to crash later on. All variables need to have
different and unique names. Missing values (see Section 2.3) need to be marked by an empty
field, or some coded value, e.g., −9999, which can be used to tell the program that this is a
missing value. Data below (or above) the analytical method detection limit must be treated in
a consistent manner (see Section 2.2) – in the case of the Kola data sets all values below the
detection limit (there were no data above an upper detection limit) were set to a value of half
of the detection limit. Thus there usually is a certain amount of editing necessary to make the
data file software-compatible.
Most statistical software, including R, require as input a rather simple data format, where
the first row identifies the variables and all further rows contain the results to be used
during data analysis. Some further requirements will be discussed below. DAS+R will read
such files, but accepts data files with additional auxiliary information like project name, a
description of sampling and analytical methods used, the lower and upper detection limits,
and the units in which the data of each variable are reported; these can be directly stored
(and thus retrieved during data analysis) with the data. This special DAS+R data file format
is constructed such that it can be turned into a “normal” R file with no more than two
commands.
The most simple file format that can be read by almost all statistical data analysis packages
consists of a header row, identifying the variables, and the results following row by row
underneath (Figure 2.1). Such a file should be stored in a simple “csv” (variables and results
separated by a comma: csv = comma separated values) or “txt” format.
REQUIRED DATA FORMAT FOR IMPORT INTO R AND DAS+R 15

Figure 2.1 Screen snapshot of a simple Microsoft ExcelTM -file of Kola Project C-horizon results. These
are ready for import by most statistical data analysis packages, including R, once stored in a simple
format, e.g., the “csv” format

The special DAS+R file format, fine-tuned for use in applied geochemistry, allows the
storing of much more information about the data and about each single variable, so it can be
retrieved and used during data analysis. This format requires an empty first column, providing a
number of predetermined key words (up to but not more than eleven) telling the software about
the auxiliary information that it is expected to handle (see Figure 2.2). The sequence in which
the keywords are provided is flexible, only HEADER and VARIABLE need to be specified,
none of the other keywords need to be used, the software checks for the keywords and stores
the information if it detects one or several of the keywords. Of course other variable-relevant
information than “EXTRACTION” and “METHOD” could be stored using these fields. The
keywords are:

HEADER: holding information on the data file, e.g., “Kola Project, regional mapping 1995,
C-horizon”.
COMMENT DATASET: this row can contain free text with additional comments that will
be kept with the data and are valid for the whole file or large parts thereof, e.g., “<2 mm
fraction, air dried, laboratories: Geological Survey of Finland (all ICP and AAS results);
Geological Survey of Norway (all XRF results); ACTLABS (all INAA results)”.

These two keywords are not linked to any specific variable but to the data file as a
whole.
16 PREPARING THE DATA FOR USE IN R AND DAS+R

Figure 2.2 Kola Project Moss data file prepared with auxiliary information (project name, sample type
and preparation, analytical method, detection limit, unit) that may be needed during data analysis.
This format will be accepted by DAS+R. If the data file is to be used outside of DAS+R, for example in
R, the first column and the uppermost ten rows should be deleted or commented out

SAMPLE IDENTIFIER: this keyword is used to identify the table column (variable) that
contains the sample or site number (or code) via entering “ID” in the column that contains
this information in the VARIABLE record. This can also be done later after the data are
imported into DAS+R if necessary. If no sample identifier is provided, the samples are
numbered from 1 to n, the number of observations (samples), and identified using this
number in the graphics and tables where samples are identified.
COORDINATES: used to identify the two columns holding the information about the
geographical coordinates via entering “XCOO” (east) and “YCOO” (north) in the columns
containing this information. If the two variables containing the coordinates are not iden-
tified, mapping and spatial functions within the software will not be available during data
analysis.
COMMENT VARIABLES: can hold a free comment that is linked to each single variable,
e.g., a remark on data quality or number (or percentage) of samples below detection.
EXTRACTION: holds a second comment linked to each variable separately like “aqua regia”
or “total” (or any other variable-related comment).
METHOD: holds a third comment, the method of determination for each variable like
“ICP-AES” or “XRF” (or any other variable-related comment).
UDL: the value of the upper limit of detection for each variable (if not applicable for the data
at hand, it is simplest to not provide this keyword and row), e.g., “10000”.
LDL: the value of the lower limit of detection for each variable, e.g., “0.01”.
THE DETECTION LIMIT PROBLEM 17

UNIT: the unit of the measurement of each variable, e.g., “mg/kg”, “wt-percent”, “micro g/kg”
(it is generally wise to try to avoid “special” signs like “%” or “␮” – see above), and finally
VARIABLE: the line usually starting with the sample identifier and providing the coordinate
and variable names. Attention: a “unique” name must be used for each variable. If the same
element has been analysed by different methods a simple text or numeric extension can be
used. Note that variable names should not contain blanks.

The data array can contain empty cells (missing values, see above, R replaces empty cells by
the code NA – not available) but should normally not contain any special signs like “<”, “>” or
“!” if the variable is to be used for statistical analyses. Text variables or variables consisting of
a mixture of text and numbers are allowed and will be automatically recognised as such. They
can for example contain important information that is linked to each sample and can be used
to create data subsets or groups (see Chapter 8). Figure 2.2 shows the above example file in
the special DAS+R format with all auxiliary information (except the upper limit of detection)
provided.
Note that different types of variables exist and can be used in DAS+R. When importing a
data file, DAS+R will try to allocate each variable automatically to a certain data type. These
types are later displayed in the software and can be changed. It is important to assign the
“correct” data type to each variable. It determines what can be done with this variable during
data analysis. If the automatic assignment is wrong, this should be a clear indication to check
the data for this variable for inconsistencies before continuing with data analysis.

Logical: TRUE/FALSE or T/F, e.g., used to identify samples that belong to a certain subset.
Integer: a number without decimals.
Double: any real numerical value, i.e. a number with decimals.
Factor: a variable having very few different levels, which can be a character string (text) or
a number or a combination thereof.
Character: any text; the sample identifier (sample number) is a typical “character” variable
because it is the unique name of each sample. It can consist of a number, a text string, or
combination of text and numbers.

2.2 The detection limit problem

A common problem in the analysis of applied geochemical and environmental data is that a
data set as received from the laboratory will often contain a number of variables where a certain
percentage of the samples return values below the detection limit (DL) (see, e.g., Levinson,
1974; AMC, 2001). There are both lower and an upper DLs for any analytical method – in
practice it is the lower DL that will most often be of concern in applied geochemistry and
environmental projects (when analysing ore samples or samples from an extremely contami-
nated site, the upper DL of the method may also need consideration). In a spreadsheet or file
as received from the laboratory these samples will most often be marked as “<DL” (or “<rl”,
“<reporting limit”), where DL is the actual value of the detection limit (e.g., 2 mg/kg). While
values marked “<” are desirable in a printed data table, they cannot be used for any numerical
processing or statistical studies. Means and standard deviations cannot be computed for sam-
ples marked “<” – and graphics or maps cannot be easily prepared. Because most statistical
tests are based on estimates of central tendency (e.g., MEAN) and spread (e.g., variance or
standard deviation), such tests cannot be carried out.
18 PREPARING THE DATA FOR USE IN R AND DAS+R

Thus selecting analytical procedures that provide adequate DLs is one of the most important
considerations in planning any applied geochemistry project. The (lower) DL is commonly
understood to be the smallest concentration that can be measured reliably with a particular
technique. Different analytical techniques will have different DLs for the same element, and
different laboratories may well quote different DLs for the same element using the same
analytical technique. In addition the DL will change with the model of the instrument, the
technician performing the analysis, and the solid-to-liquid ratio in any procedure requiring
sample dissolution.
The choice of the analytical method for a given project is often a compromise between cost,
a careful evaluation of expected concentrations, and the need to have a complete data set for
the elements required (e.g., for geochemical mapping). In many instances the practitioner will
end with a data set containing a substantial number of “<” signs. As mentioned above, these
create problems in data analysis. There are several approaches to dealing with variables where
some analytical results are “<DL”. Options are:

 Delete the whole variable or all samples with values “<DL” from data analysis;
 Mark all observations “<DL” as missing;
 Model a distribution in the interval [0, DL], and assign an arbitrarily chosen value from this
distribution to each sample <DL;
 Try to predict a value for this variable in each sample via multiple regression (imputation)
techniques using all other analytical results; or
 Set all values marked “<DL” to an arbitrarily chosen low number, e.g., half the DL.

None of these solutions is ideal. To delete samples from data analysis is not acceptable, it
will shift all statistical estimates towards the “high” end (or “low” end – in the rare case
of observations above the upper limit of detection), although there is information that the
concentration in a considerable number of samples is low (high). The same happens if the
values are marked as “missing”.
Modelling a distribution based on the existing data and an assumption about the shape of
the distribution (e.g., lognormal) will give the statistically most satisfying result. On the other
hand it is then no longer easy to differentiate between true (measured) values and “modelled”
values that have no geochemical legitimacy beyond the model, especially in the multivariate
context. This is again an undesirable situation.
To predict the values <DL, for example via regression techniques using all other analytical
results or kriging (for spatial data – Chapter 5), is only sensible when the value is really
needed, e.g., for mapping. The problem of differentiating between true (measured) values and
“modelled” values that have no geochemical legitimacy beyond the model, especially in the
multivariate context, remains.
Thus the most widely applied solution to the “<DL” problem is to set the value to an
arbitrarily chosen value of half the DL, some practitioners use the value of the DL (but note
that there may be “true” measurements at this value, thus 1/2 of the DL is the better choice).
This is easy to do in a spreadsheet and works well as long as the number of values <DL is
relatively low (say less than 10–15 percent of all samples). A more realistic value than half
the DL can often be estimated from an ECDF- or CP-plot (see Chapter 3). This method is
acceptable for constructing statistical graphics or maps – the value of half the DL is used as
a “place holder” and marks “some low value” in the graphics. Often it is easy to recognise
<DL situations in a graphic, e.g., a scatterplot (e.g., Figure 18.8) or an ECDF- or CP-plot
THE DETECTION LIMIT PROBLEM 19

by a straight line of values at the lower end of the data (e.g., Figure 18.7). An element that
contains values below the detection limit is called “censored”. In case of data below the limit
of detection, they are stated to be “left censored”; conversely, in case of data above the upper
limit of detection, they are stated to be “right censored”.
To just replace the values <DL by an arbitrary low value is not, however, acceptable if we
need to calculate a “reliable” estimate (e.g., for complying with some environmental regulation)
of central tendency and spread (see Chapter 4), as needed for most statistical tests. Helsel and
Hirsch (1992) and Helsel (2005) provide a detailed description of these problems and possible
solutions. A first and easy solution is to use MEDIAN (Section 4.1.4) and MAD (median
absolute deviation, see Section 4.2.4) as estimators of central value and spread instead of
MEAN (Section 4.1.1) and SD (standard deviation, Section 4.2.3) (but remember that the
values representing “<DL” need to be kept in the data set as the lowest (highest) values for a
correct ranking). MEDIAN and MAD can be used as long as no more than 50 per cent of the
data are below the limit of detection.
To further complicate the issue, analytical methods exist where the laboratory will report
different DLs for different samples depending on the matrix of the actual sample (typical
for Instrumental Neutron Activation Analysis, INAA). It is also not unusual to find different
detection limits over time because a laboratory has improved its methodology or instrumen-
tation, or when employing different laboratories reporting with different DLs. One possible
solution in this situation is to use the value of half the lowest DL reported as “place holder”
for all samples <DL.
To solve these problems, the reporting of values below detection limit as one value has been
officially discouraged in favour or reporting all measurements with their individual uncertainty
value (AMC, 2001). However, at the time of writing, most laboratories are still unwilling to
deliver instrument readings for those results that they consider as “below detection” to their
clients and the replacement of values that are marked “<DL” by a value of half the DL prior
to data analysis is still a practical, if not perfect, solution for many applications.
In general a conscious decision is required as to whether a variable with censored data is to
be included in a multivariate data analysis. A variable can, for example, contain 10 per cent of
censored data, an amount that could probably be handled by the selected multivariate method.
However, a second variable can also contain 10 per cent of censored observations, but the
samples with censored values do not need to be the same as those with censored values in
the first variable. In the worst case all censored values occur in different samples and this
will accumulate to 20 per cent of censored observations for the multivariate data set. Before
entering multivariate analysis, it is thus important to check how many observations (samples)
are plagued with missing values for any variable. The proportion of such samples should be as
low as possible. If standard multivariate methods are to be used, the total proportion of samples
with censored data should probably not exceed 10 per cent. Even when planning to use robust
methods (which could in theory handle up to 50 per cent of “outliers”), care is necessary
because further data quality issues or missing values in the remainder of the data set may exist.
The problem of censored data can create a serious dilemma when dealing with several
sample materials that are to be compared. For example, when excluding variables with more
than 5 per cent of censored data from all four sample materials collected during the Kola Project
(moss, O-, B-, and C-horizon), only 24 (out of more than 40) variables remain in common for
direct comparison.
To avoid as many “DL-problems” as possible, one of the most important first steps in the
design of the Kola Project was to construct a list of all the chemical elements for determination,
20 PREPARING THE DATA FOR USE IN R AND DAS+R

setting a “priority” for each element (how important is the element for project success), and
identifying the detection limit that was needed to obtain a complete data set for this element
for all of the sample materials. This list can then be compared against the analytical packages
offered by different laboratories together with their respective detection limits. The list can
then be used to discuss detection limits, costs (if appropriate), and terms with the most suitable
laboratories. This list is of course an “ideal” that cannot be achieved for all elements. For some
elements no analytical technique may exist that provides a low enough DL for obtaining a
“complete” data set (no sample <DL) at the time of the project. For others, procedures may
exist but be so expensive that it is not possible to use them within a given budget. Thus in the
end the final choice of the elements to be determined, the detection limits, and the choice of
the laboratory that undertakes the work will be a compromise. It is always good practice to
archive a complete sample set to be recoverable later when analytical techniques are improved
(or cheaper) or interest in some element(s) arises that was of no interest at the time of the
original project.

2.3 Missing values

In any large data set some values may be missing due to a variety of reasons. It may be that
there was not enough sample material to carry out a certain determination; it may be that a
single number got lost somewhere in the data processing. Such situations lead to values that
are missing at random. A different situation arises if there are missing samples due to some
particular event. In that case the missing values could be concentrated in particular parts of the
data distribution or survey area.
If any variable or sample has many missing values, this will again create problems for
statistical data analyses. In contrast to censored data having some information value (e.g.,
that the concentration in the sample is very low, lower than the “limit of detection”), missing
data are “no information” and are not limited to low (or very high – upper limit of detection)
concentrations. Several possibilities to treat missing values exist:

 Delete the whole variable or all samples with missing values from data analysis;
 Set the missing result to an arbitrarily chosen value; or
 Try to predict the missing values, for example via multiple regression (imputation) techniques
using all other analytical data (Chapter 16).

None of these suggestions is an optimal solution. The most usual approach to the missing value
problem is to just ignore them. This is a possibility as long as the number of missing values
is low. When a data set contains missing values, it is generally a good idea to only use robust
statistical methods.
To predict a missing value, for example via regression techniques (imputation – Chapter 16),
using all other analytical results or kriging (for spatial data – Chapter 5), is only sensible when
the value is really needed, e.g., for mapping.
If the missing values do not occur at random, this will result in biased statistical estimates,
independently of the selected approach for accommodating the missing values.
For multivariate data analysis the same problem of “accumulating” missing values occurs
as described above for “<DL” data. Therefore, obtaining as complete a data set as possible is
always an important consideration during project design.
SOME “TYPICAL” PROBLEMS ENCOUNTERED WHEN EDITING 21

2.4 Some “typical” problems encountered when editing a laboratory data


report file to a DAS+R file

A data file as received from a laboratory will usually contain a lot of text that is necessary
only for general information purposes. Figure 2.3 shows a screen snapshot of one of the Kola
Project result files as received from the laboratory. This kind of file cannot be easily imported
by most data analysis software. The file will probably contain information on the analytical
technique(s) used for the determination of each element, the detection limit, the unit of the
measurement for each variable, and more. Some of these text lines contain the information
DAS+R can hold in the first 11 comment rows, others may need to be deleted or edited. Files
may contain “accessory” information or comments at the beginning as well as at the end of
the data array (check the bottom line of any file – see Figure 2.3), or this information may be
given in other worksheets (check for multiple work sheets!) within the file. If the laboratory
has invested some time in table design and merged cells somewhere, e.g., at the top of the file,
this will create severe problems when trying to import such a file into data analysis software.
In contrast an “ideal” file for most data analysis packages will only contain one row identi-
fying the variables (with the first variable being the “Identifier” for the sample (i.e. the sample
number), and then row by following row will contain the identifier and the analytical results

Figure 2.3 One of the “original” laboratory data files as received for the Kola Project C-horizon results
22 PREPARING THE DATA FOR USE IN R AND DAS+R

for all variables listed in the first row in the same sequence (see above). If possible such a
file format should be agreed upon with the laboratory at an early stage of the project. The
“optimal” data report file from a laboratory will look like the files in Figures 2.1 or 2.2 rather
than like a “nice table”.

2.4.1 Sample identification


Laboratories often add their own “laboratory number” to all samples received for analysis
according to the laboratory’s standards (see Figure 2.3). This laboratory number is used as a
“unique” number for storage in the laboratory’s database. Make sure that the laboratory reports
the results according to the original sample numbers and not with these “new” laboratory
numbers only. In the case where samples were randomised and given new numbers prior to
submission (see Chapter 18), it is necessary to ensure that the laboratory analyses the samples
in the exact sequence of the new numbers and not in the sequence of the laboratory’s own
numbers. This requires that the “submitter” must keep and safely back-up the table that relates
the original sample numbers to the new randomised sequence numbers that are on the physical
samples submitted to the laboratory.

2.4.2 Reporting units


Units used to report the values are often quite different from variable to variable and information
on units is provided somewhere in the data file. In the given example (Figure 2.3) several
different units as identified in row five are used for the few variables (␮g/kg, mg/kg, g/kg) visible
in the screen snapshot. Thus the units of analysis are important information. Unfortunately,
most statistical data analysis packages do not appreciate the presence of this row. For many
packages this row disrupts the connection between variable names and values. Keeping the
row in this position will create severe problems in entering the data into most data analysis
software, deleting it will create severe problems during data analysis when it becomes important
to know the units! It is tempting to think that this need not be a problem because the unit can
usually be guessed from the analytical values, and if not it is always possible to go back to the
“original” data file (good if “the original” data file was stored somewhere safely as suggested
above – a problem if this was not done and the units were deleted for data analysis). It is
often not so easy to “guess” the correct unit of a measurement just from the analytical values.
Thus information on the unit needs to be preserved – but where and how? DAS+R can store
the information about the unit, but it is often better to transfer all results to one unit only.
For geochemical projects “mg/kg” (mg/L for water) is the most universal unit. This approach
cannot be followed in all situations because, for example, pH measurements come without
a unit, conductivity measurements are reported in mS/m, and radioisotopes are measured in
Bq/kg, and all these may appear in the same data file. It is, nevertheless, a good start to “unify”
the units as much as possible. In case another unit is needed later on for “nice looks”, for
example in a data table, it is easy to transform any values to that unit. It is also necessary to
check that the unit the data are reported in is not changed during the lifetime of a project or
even within just one data file (e.g., different staff measuring different batches of samples and
using different reporting units!). It is often possible, and advisable, to agree on the unit for
reporting measurements when formulating an analytical contract with a laboratory. That way
a lot of unnecessary and always dangerous file editing can be avoided.
SOME “TYPICAL” PROBLEMS ENCOUNTERED WHEN EDITING 23

2.4.3 Variable names


Variable names need to be unique. For the laboratory it may be sufficient to use the same
variable name if the method code or extraction is identified as different in another row of the
data file – for data analysis this is not sufficient! If one and the same element is analysed
by a variety of techniques or in a variety of different extractions, this information needs to
be incorporated into the variable name (e.g., Ca aa, Ca ar and Ca tot for Ca determined in an
ammonium acetate extraction, Ca determined in an aqua regia extraction and Ca determined by
XRF or INAA, giving total concentrations). The same applies when files with different sample
materials are linked (e.g., Ca MOSS, Ca OHOR, Ca BHOR, Ca CHOR). Note that most data
analysis programs cannot handle variable names containing a space – it is thus advisable to use
another sign as separator and an underscore is a common solution. If material and method need
to be included in a variable name (e.g., Ca CHOR ar, Ca CHOR tot), very long variable names
may result. Note that lengthy variable names may not be convenient during data analyses (e.g.,
when the variable names are displayed in graphics).

2.4.4 Results below the detection limit


The laboratory will have marked results below the detection limit (Section 2.2) somehow –
either via a “<” sign followed by the value of the detection limit or by the actual reading of
the instrument marked in some specific way as “below detection” (e.g., via an exclamation
mark before or after the result or by a “−”). Such markers are acceptable in a printed table,
however, a computer or a data analysis package will not know what to do with such special
symbols. In the example file (Figure 2.3) the laboratory has managed to use different ways
to identify measurements that fall below the lower limit of detection. For some variables a
“!” marks such values, for other samples a “<” was used. As most data analysis software
cannot handle such special signs they need to be removed. However, these signs cannot just
be deleted because the file will most likely contain true measurements with exactly this value,
and then these true values could no longer be differentiated from values that are lower. It
may also be tempting to just delete the contents of the whole cell marked as “lower than
the detection limit”. However, an empty cell denotes a “missing value”, a value “below de-
tection” denotes a value too small to measure with the chosen analytical method, which is
an important difference when making statistical estimations. It is also no solution to replace
these values with a “0” – for some variables “0” may be a true data value, importantly “0”
will often provide a far too low estimate of the real concentration, and in case there are any
“0” in a data file, logarithmic transformations become impossible. One standard solution to
the situation (the solution chosen for the Kola data) is to replace all values marked as “lower
than detection limit” by a value of one half of the detection limit. Therefore, in the example
data file from the Kola Project all values marked with a “!” or a “<” need to be replaced
by a value of one half of the detection limit. Before starting to replace the “<” sign, it is
necessary to check that the laboratory is operating with one detection limit per variable only.
It may happen that the laboratory has operated with different detection limits, depending on
the matrix of the samples (a common situation in neutron activation analyses (NAA or INAA
(“I” for “instrumental”)). When replacing all “<” values with one half of their value in such
a situation some of these replaced values may suddenly be in the range of “real” values. Thus
this situation requires special care (one possible solution is to replace all the “<” values with
one half of the lowest DL value reported – see Section 2.2). Afterwards it is easy to use the
24 PREPARING THE DATA FOR USE IN R AND DAS+R

“find” and “replace” functions in spreadsheet software to get rid of all the “<” signs (and of
the “!”).
When actually doing this for the example file, it turned out to be more difficult than antici-
pated – the laboratory had used blanks (sometimes none or one, two, three, four, and even five
blanks) between the actual number and the sign. Each single one of these possibilities needs
thus to be checked until in the end there are no more “<” or “!” in the data file. Thus great care
is necessary when replacing all values below detection with a value of one half of the detection
limit. To avoid confusions (e.g., when a file contains values <1 and <10) this should be done
variable by variable when using “find” and “replace” options of the spreadsheet software rather
than for the whole data file at once. It is good practice to agree with the laboratory beforehand
exactly how values below (or above) detection limit are to be reported. As can be seen this
editing process can be error prone; this just emphasises the importance of making a back-up
copy of the original file received from a laboratory, so it can always be referred to in case it is
thought that an editing error may have occurred.

2.4.5 Handling of missing values


Empty fields may exist in a data file, indicating “missing values” (Section 2.3), e.g., a sample
that got lost during transport or where there was not enough material available for all the
different analytical techniques that were ordered. Never replace empty fields by a number (for
example it could be tempting to use a “0”). Empty fields indicate “no information”, if these
fields are replaced with zeros, which may be valid for a measurement, their original meaning
will be lost. The “empty” information thus needs to be preserved or replaced by a term that the
software recognises as “missing value”. In Figure 2.3 in row 14 (and row 798) the special sign
“−” was used by the laboratory to identify whole samples or single variables for a sample that
were not analysed. Here the “−” thus marks “missing values”, a directly dangerous approach.
Such a “−” should be immediately replaced by an empty cell. When doing this replacement,
it is wise to use the “find entire cells only” option to ensure that no real “−” sign, identifying
possibly existing negative values, is replaced. Also, it is not a good idea to just delete a whole
row with “missing values”. At some point it may either be important to identify the samples
with missing values, to know the absolute number of missing values, or another data file will
be linked with the first data file, and this new file may contain data for the sample that contains
missing values in the first file. The standard in R to identify missing values is to use NA
(not available) as marker for a missing value, NA is automatically inserted when a data file
containing empty fields is imported to R (or DAS+R). Thus empty fields should be left empty
when editing the data file.

2.4.6 File structure


In case the file contains a column identifying the variables and the results follow in columns
rather than in rows, it is possible to use the “copy” and “paste special” – “transposed” - features
in a spreadsheet to produce a transposed file that follows the above standard. Again it is essential
to ensure that the original data file is safely stored somewhere before attempting to transpose
the file so that it cannot be destroyed by this procedure.
Data tables as received from different laboratories as spreadsheets may have a very dissimilar
structure, and thus there is no simple way around a substantial amount of file editing, no matter
whether these files are to be incorporated into a database or directly imported into a data
analysis package.
APPENDING AND LINKING DATA FILES 25

For R (outside DAS+R) there exists the possibility to mark several leading (top) rows with
a “#”. The software will ignore lines marked this way (just like comment lines in a computer
program) when importing data, and the information is at least not deleted. Data files prepared
with all auxiliary information for DAS+R can thus be directly read into other R-routines using
this facility without having to re-edit the file.

2.4.7 Quality control samples


However, before the file is really ready for import into R or DAS+R, more work may still
be required. Hopefully the file contains Quality Control (QC) samples (i.e. duplicates and
standards inserted among the “real” samples). These QC-samples must be removed prior
to data analysis (and used for quality control (see Chapter 18)). The “key-file” identifying
QC-samples needs to be used to identify and remove the QC samples from all other samples.

2.4.8 Geographical coordinates, further editing and some unpleasant


limitations of spreadsheet programs
For geochemical mapping and thus identifying the location of unusual results the samples also
need coordinates. Thus these coordinates need also to be linked to the data file. Furthermore,
analytical results may be received from more than one laboratory, and all these different results
will be needed in just one large file for data analysis. A “cut” and “paste” job in a spreadsheet
has to be carried out with great care. If very many variables exist, or if results for different
materials are to be linked into just one file, it is possible to reach the limits of the most common
spreadsheet software packages. These are usually limited to 256 columns. In such a case it is
also possible to link the different files in R (or DAS+R). Alternatively, those users familiar
with database software, e.g., Microsoft AccessTM , may choose to import the different field,
laboratory, and other tables into the database manager and undertake the editing and final table
construction in that environment.
Many packages used for data analysis will not be able to read the special spreadsheet file
format (e.g., “.xls” files). Furthermore, software manufacturers sometimes change their formats
with each upgrade of their spreadsheet software with the consequence that older versions of
the program are no longer able to read files that were saved in a newer version. This can turn
out to be a real nuisance when data files are to be exchanged between colleagues. Thus the
“final” file should be saved separately in both the spreadsheet format (for easy handling and
editing) and in a simple text or ascii format. For this it is best to use the “.csv” format, which
is a simple file format where the variables and values are separated by commas. To use the
“.csv” format, it is necessary to make sure that the comma was not used as decimal sign in
the file. This may actually even depend on the “country” setting of the spreadsheet used for
editing. Note that the “.csv” format does not support multiple worksheets; information stored
in different worksheet will be lost when storing a former Microsoft ExcelTM file in “.csv”
format. The major advantage of this format is that almost all “foreign” software packages are
able to import the “.csv” format. This is the file format used to import all Kola data into R (and
DAS+R).

2.5 Appending and linking data files

The Kola Project was a typical multi-element multi-medium regional geochemical mapping
project. In consequence there is not only one regional data set, but there are four such regional
26 PREPARING THE DATA FOR USE IN R AND DAS+R

data sets, one for each sample material (there are actually even more data sets, one for
Topsoil (0–5 cm) and one for lake water (Russia only)). Thus similar files with analytical
results were received for the C-, B- and O-horizon of Podzol profiles and for terrestrial
moss, all collected at the same sample sites. All these files needed to be prepared accord-
ing to the above example. The prepared files are provided in “.csv” format on the web page
http://www.statistik.tuwien.ac.at/StatDA/R-scripts/data/.
However, it may be required to do more with the data than just look at them layer by layer.
A direct comparison between the different sample materials will be of interest, it may thus be
necessary to be able to produce diagrams and statistics using all four layers at the same time.
For that it is necessary to merge the different data sets in some sensible way. This can actually
be done in two different ways, and depending on how it is done, different comparisons and
diagrams can be produced.
For example, it is possible to just add additional rows to the “first” file (Moss or C-horizon)
for each additional material. Hopefully as many variables as possible are identical (i.e. the
same analytical program was used for all materials), otherwise this is a substantial data-editing
job, which needs a lot of care. Now the sample material needs to be identified in one of the
variables, e.g., as “MOSS”, “OHOR”, “BHOR” and “CHOR”. Using this variable it is then
possible to prepare “data subsets” according to sample material and compare results for these
subsets in tables and different diagrams. It is possible to append data directly in DAS+R.
However, it may also be necessary to be able to make a direct comparison sample for sample.
This is not easily achieved with such a file structure.
Hopefully this possibility was considered when designing the fieldwork and sample-
numbering scheme (data analysis starts with project design). Many organisations may require
that a “unique sample number” per sample should be used because organisations such as a
Geological Survey often build a project-independent database and want to minimise the chance
of mixing up sample numbers. However, working with unique sample numbers can create a
nightmare during data analysis because it is necessary to somehow tell the data analysis soft-
ware which samples are related. In cases where unique sample numbers were used, this will
have to be done sample by sample – an unnecessary and time-consuming task that can waste
a lot of time for a large data set.
For the Kola Project this issue was considered during project design as it was recognised
that at some stage it would be informative to be able to compare results for the different sample
materials. The sample number thus included a unique component that identified each catchment
where all samples that belong together were taken. An additional unique code identified the
sample material. Care was taken that these codes were obviously different from one another in
order to minimise the chance of mixing up different materials (e.g., C-, B-, and O-horizon soil
samples) in the field or later during sample preparation in the laboratory. For example, parts of
the number can disappear from a sample bag during transport and handling of the samples and
it should still be possible to identify the material and the identity of the sample beyond doubt.
Thus “CHOR”, “BHOR” and “OHOR” (as in the data files) was not used in the field but P5C,
P4B and “HUM” (for “humus”). In consequence, all samples that were taken at one and the
same sample site can then be easily linked via the sample number (ID in the data files). It is
only necessary to add on the data from the additional materials as additional columns.
Data analysis software is often not able to handle a file containing several variables with
the same name (R will recognise such a situation and “number” such variables, e.g., Ca1, Ca2,
Ca3). For the Kola data set, in many instances, four variables exist with the same name, e.g.,
the variable silver (Ag) in the C-, B- and O-horizons, and moss. Thus the sample material needs
REQUIREMENTS FOR A GEOCHEMICAL DATABASE 27

to be indicated in the variable name. Again the problem arises that variable names tend to get
too long when a lot of information needs to be incorporated (e.g., different extractions and/or
analytical techniques used, different grain size fractions analysed, different sample materials
analysed). It may thus be beneficial to limit the number of variables included in this file to the
“critical variables” where results between the materials are really comparable and most of the
samples returned values above the detection limit.
As mentioned above (Section 2.4.8), spreadsheet software has clear limitations as to the
number of variables that one file can contain. At present this limit is 256 – with four different
sample materials with 60 variables determined, this limit is easily reached when additional field
information needs to be incorporated into the file. To append data to a spreadsheet requires
that at the very beginning a complete set of “sample numbers” (and coordinates) existed for
all materials and that no rows in the data files were deleted just because only “missing values”
were present for one of the samples in one of the different layers sampled. It is also necessary
to treat the field and analytical duplicates in the different layers the same way (i.e. always
removing the second result), otherwise a considerable amount of extra editing work can result.
Instead of doing all this in a spreadsheet where this kind of work requires a lot of attention,
files can be directly linked in DAS+R using the link command, which will automatically
link all samples with the same ID. Or alternatively, this can be done externally in a database
package.
In the end there may exist as many as six separate data files: one for each sample material
(e.g., CHOR, BHOR, OHOR, and MOSS), one with all results and the sample material encoded
in the variable name for the creation of subsets according to sample material, and a last file
with four results per analysed element (e.g., Ag C, Ag B, Ag O, and Ag M), one for each
material. Using six different files can create problems if results of one sample are changed or
deleted somewhere in one data file during some point of data analysis – it is thus necessary to
consider the possibility of keeping all the files up-dated to the newest version at all times, as
is done when using a database.

2.6 Requirements for a geochemical database

A geochemical database can be relatively easily built using existing software solutions like
OracleTM or Microsoft AccessTM . A number of questions need to be addressed when building
a geochemical database. A database needs an administrator to guarantee its longevity. It is
very difficult to build an “ideal” institutional database for an organisation because this requires
an almost inhuman amount of foresight in terms of what kind of work will be carried out in
future projects. A database constructed without an in-depth system analysis can easily turn
into quite a secure data grave, with the guarantee that the data are there and safely stored but
never used because they are too difficult to retrieve in the form needed for further data analysis.
Even building a good database for just one project requires a thorough system analysis – what
are the data going to be used for, and in which form does one need to recover the data for
data analysis are some of the very first questions that need to be answered before starting to
construct a project database.
In general the data should be entered into the database as soon as possible after collection
to minimise the chance that any are lost or important details are forgotten. On first sight,
constructing a database may look like an additional bureaucratic effort; in the long run it will
pay back the time spent on database design. These days most data are provided by laboratories
28 PREPARING THE DATA FOR USE IN R AND DAS+R

in the form of spreadsheets, e.g., Microsoft ExcelTM or Quattro ProTM files. Unfortunately, such
files are not a secure medium to archive data. However, they do provide a widely available and
familiar tool for original data capture and reporting, and some table re-arrangement functions.

2.7 Summary

For entering data into a statistical data analysis package a file structure is needed that deviates
from nicely designed tables. Usually substantial data editing is necessary before being able to
start any data analysis. Especially with large data tables there is a high risk that errors may be
introduced during data editing. Before starting with any data editing, a copy of the original file
as received from the laboratory needs to be safely stored to be able to check that any questions
that may arise during data analysis are not related to mistakes introduced during data editing.
This file should never be modified in any way. A data file usually contains important auxiliary
information (e.g., measurement units and techniques) that need to be taken care of. At the same
time variable names should be kept as short as possible. Variable names should not contain any
blanks or any language specific special characters. When working with chemical analyses, it is
important to consider the handling of values below (or above) the detection limit, and missing
values. It is also important to consider how to preserve information on sample preparation,
analytical technique used, detection limit reached and the data unit. Much work can be spared
if a contract is formulated with the laboratory on how the report file for a project should be
constructed.
It is an advantage to work with simple standardised file formats, e.g., “csv” (comma
separated values) or “txt” because these are independent of the computer’s operating system.
3
Graphics to Display the Data
Distribution

As the Kola Project progressed, chemical analyses for c. 50 chemical elements in four differ-
ent sample materials collected at about 600 sample sites were received. These needed to be
summarised, compared and mapped; now we enter the realms of statistical data analysis. For
data collected in space, two types of distribution need to be considered: the spatial distribution
of the data in the survey area and the statistical data distribution of the measured values. At
the beginning of data analysis the focus is on the statistical distribution, temporarily setting
the spatial component aside (see Chapter 5, where the spatial data distribution is studied).
The statistical data distribution can be explained using some small artificial data sets. The
data are simply plotted against a straight line as a scale for the data values. Depending on the
data, the distribution of the values can look symmetrical (Figure 3.1, upper left, the data are
distributed symmetrically around the value 20). The data could fall into two groups, i.e. be
bimodal, and show a gap around the value 20 (Figure 3.1, upper right). The data could show a
single extreme outlier (Figure 3.1, lower right). The data could show an asymmetrical distri-
bution with a high density of points at the left side and a decreasing density towards the right
side (Figure 3.1, lower right). All these different characteristics, or a mixture thereof, define
the statistical data distribution. For further statistical treatment of the data it is essential to get
a good idea about the data distribution. Many statistical methods are, for example, based on
the assumption of a certain model for the data distribution (see Section 4.1, Figure 4.1 for a
number of different model distributions).
With more than 600 measurements it will not be sufficient to simply plot the measured
values along an axis as in Figure 3.1, because many data values will plot at the same point.
Other graphics are needed to visualise the data distribution. These will be introduced on the
following pages.

3.1 The one-dimensional scatterplot

Plotting the data along a straight line works fine as long as the data set is small and the data
do not plot too close to, or on top of, one another. Figure 3.2, upper diagram, demonstrates
the problem using the analytical results of Sc in the Kola C-horizon. Many more data points
may be hidden behind one point plotted along the line. To view samples that have the same

Statistical Data Analysis Explained: Applied Environmental Statistics with R. C. Reimann, P. Filzmoser, R. G. Garrett,
R. Dutter © 2008 John Wiley & Sons, Ltd. ISBN: 978-0-470-98581-6
30 GRAPHICS TO DISPLAY THE DATA DISTRIBUTION

0 10 20 30 40 0 10 20 30 40

Data values Data values

0 10 20 30 40 0 10 20 30 40

Data values Data values


Figure 3.1 Some possible data distributions. Upper left: symmetric; upper right: bimodal with a gap
around 20; lower left: left skewed; lower right: right skewed

value it is, however, possible to add a second dimension and add such values as additional
symbols against the y-axis (Figure 3.2 middle, stacked scatterplot). With many values at the
same position the y-axis starts to dominate the plot. With some further modifications the
one-dimensional scatterplot is obtained (see, e.g., Box et al., 1978); an informative and simple

5 10 15 20 25 30 35

Sc in C−horizon [mg/kg], points overplotted

5 10 15 20 25 30 35

Sc in C−horizon [mg/kg], points stacked

5 10 15 20 25 30 35

Sc in C−horizon [mg/kg], points jittered

Figure 3.2 Evolution of the one-dimensional scatterplot demonstrated using Sc as measured by


instrumental neutron activation analysis (INAA) in the samples of the Kola C-horizon
THE HISTOGRAM 31

graphic to study the data distribution. This plot is typically displayed as an elongated rectangle.
Each value is plotted at its correct position along the x-axis and at a position selected by chance
(according to a random uniform distribution) along the y-axis (Figure 3.2, lower diagram). This
simple graphic can provide important insight into structure in the data.
In Figure 3.2 (stacked and one-dimensional scatterplot) a significant feature is apparent that
would be important to consider if this variable were to be used in a more formal statistical
analysis. The data were reported in 0.1 mg/kg steps up to a value of 10 mg/kg and then rounded
to full 1 mg/kg steps – this causes an artifical “discretisation” of all data above 10 mg/kg.

3.2 The histogram

One of the most frequently used diagrams to depict a data distribution is the histogram. It is
constructed in the form of side-by-side bars. Within a bar each data value is represented by
an equal amount of area. The histogram permits the detection at one glance as to whether a
distribution is symmetric (i.e. the same shape on either side of a line drawn through the centre
of the histogram) or whether it is skewed (stretched out on one side – right or left skewed). It is
also readily apparent whether the data show just one maximum (unimodal) or several humps
(multimodal distribution). The parts far away from the main body of data on either side of the
histograms are usually called the tails. The length of the tails can be judged. The existence or
non-existence of straggling data (points that appear detached from the main body of data) at
one or both extremes of the distribution is also visible at one glance. The Kola C-horizon data
set provides some good examples.
Figure 3.3 shows four example histograms plotted for the variable Ba (aqua regia extraction)
from the Kola C-horizon data set. The x-axis is scaled according to the range of the data, the
starting point is usually a “nice looking” value slightly below the minimum value. Intervals
along the x-axis are adapted to the number of classes it is required to display, and the number
of classes is chosen such that the whole data range of the variable is covered (several rules√of
thumb exist for the “optimum” interval length or number of classes, one of the easiest is n
for the number of classes, where n is the number of individuals/samples in the data set). The
y-axis shows the number of observations in each class or, alternatively, the relative frequency
of values in percent.
In theory, when dealing with a very large data set, the length of the intervals could be
decreased so much by increasing the number of classes that the typical histogram steps disap-
pear. This results in a plot of a smooth function, the density function. This smooth function is
thought to represent the distribution from which the data are sampled by chance. If the data
were drawn from a normal distribution the density function would take on the classical bell
shape (see Figure 4.1).
One situation that often arises with environmental (geochemical) data is that the distributions
are strongly right-skewed. In addition, extreme data outliers occur quite frequently. In such
cases a histogram plotted with a “linear” scale may appear as a single bar at the left hand side
and a far outlier at the right hand side of the graphic. Such histograms contain practically no
information of value about the shape of the distribution (see upper left histogram, Figure 3.3).
Statisticians will usually solve such problems via scaling the data differently. In the case at hand
it appears advisable to reduce the influence of the high values and to focus on the main body of
data that lie in the first bar of the histogram. One solution that meets these requirements is to
scale the data logarithmically (lower right histogram, Figure 3.3). This re-scaling is called log-
32 GRAPHICS TO DISPLAY THE DATA DISTRIBUTION

200
100 200 300 400 500

150
Maximum = 1300 mg/kg
Frequency

Frequency
100
50
0

0
0 200 400 600 800 1000 1200 0 50 100 150 200 250
Ba in C−horizon [mg/kg] Ba in C−horizon [mg/kg]

150
20 40 60 80 100

Maximum = 1300 mg/kg


100
Frequency

Frequency
50
0

0 50 100 150 200 250 0.5 1.0 1.5 2.0 2.5 3.0
Ba in C−horizon [mg/kg] 5 10 20 50 100 200 500 2000
log10(Ba) and Ba in C−horizon [mg/kg]

Figure 3.3 Histograms for Ba (aqua-regia extraction) from the C-horizon of the Kola data set. Upper
left: original data; upper right: truncated data; lower left: truncated data as upper right but with double
the number of classes; and lower right: log-transformed data. The lower right histogram has two scales
to demonstrate the logarithmic transformation, the upper scale indicating the logarithms and the lower
scale the original untransformed data

transformation of the data; a variety of data transformations are used in statistics to improve the
visibility (or more generally, the behaviour) of the data (see Chapter 10). In geochemistry the
log-transformation is the transformation that is most frequently used. Note that in the example
the log-transformation results in an almost symmetrical distribution of the strongly right-
skewed original data and that the range (i.e. maximum value − minimum value) of the data is
drastically reduced because the influence of the outliers has been reduced. Plotting a histogram
of the log-transformed values has, however, the disadvantage that the direct relation to the
original data is missing (upper scale on the x-axis). This can be overcome by using a logarithmic
scale for the x-axis (lower scale on the x-axis), this is the procedure adopted by DAS+R.
To permit scaling in the original data units, which are easier to read than a log-scale, it is also
possible to plot a histogram for a certain part of the data only, e.g., for a certain data range (upper
right histogram, Figure 3.3, range 0 to 250 mg/kg). In that case the outliers are not displayed;
and the “real” maximum value should be indicated in the plot so that an unsuspecting reader
does not gain an erroneous impression about the complete data distribution from the truncated
histogram. Plotting more classes will result in a better resolution of the distribution (lower left)
and will often considerably change the histogram’s shape. Plotting too many classes will result
in ugly gaps in the histogram.
THE HISTOGRAM 33

The choice of starting point and class interval will substantially influence the appearance
of the resulting histogram. This is shown in the lower left, where the only difference from
the histogram in the upper right position√ is that the number of classes was increased from
13 to 26 (according to the simple n rule of thumb, 25 would be the “optimum” number
of classes for the Kola Project C-horizon data). While the lower right histogram could be
taken as an indication that a lognormal distribution is present (i.e. the log-transformed data
follow a normal distribution), the more detailed histogram suggests that this may be in fact
a multimodal distribution, approaching symmetry if log-transformed. The histogram could
clearly be misused to demonstrate a “lognormal” distribution by reducing the number of classes
(a statistical test of the distribution – see Chapter 9 – will indicate that even the log-transformed
data for Ba do not follow a normal distribution). It may also be possible that the spikes in the
distribution only appear because too many classes were chosen for the number of samples –
the spikes might then be artefacts of the way the data were reported by the laboratory, e.g., to
the nearest 1, 2, 5, or 10 mg/kg. This demonstrates how important the choice of the optimum
interval length (number of classes) is when constructing histograms. Modern √ software packages
will automatically use more sophisticated mathematical models than the n rule of thumb for
that purpose (see, e.g., Venables and Ripley, 2002).
It can be concluded that by studying or displaying just one histogram, important aspects of
the distribution may be missed. It may thus be necessary to plot a series of histograms for a
variable (as above), or the histogram should be augmented with some other graphics displaying
the data distribution and showing additional features.
One possibility is to combine the histogram with the one-dimensional scatterplot
(Figure 3.4).
150
Frequency

100
50
0

0 10 20 30 40

Sc in C−horizon [mg/kg]

Figure 3.4 One-dimensional scatterplot for Sc (total concentrations as determined by INAA) in the
Kola C-horizon data set, combined with the histogram
34 GRAPHICS TO DISPLAY THE DATA DISTRIBUTION

3.3 The density trace

The density trace (kernel density estimate – Scot, 1992) is on first inspection a “smoothed line,
tracing the histogram” (Figure 3.5). It is another approximation of the underlying density func-
tion of the data. Each point along the curve is calculated from data within a defined bandwidth
using a “weight function” (in most packages; including R, these parameters are chosen by
default but can be changed manually). Choice of bandwidth and weight function will crucially
determine how the final density trace appears (compare Figure 3.5 density trace upper right with
density trace lower left). Although at first glance the density trace is a more “objective” graphic
to display the data distribution than the histogram, it can also be manipulated substantially. Den-
sity traces are better suited for comparing data distributions than histograms because they can
easily be plotted on top of one another (e.g., in different line styles or colours – see Chapter 8).
The plots (Figure 3.5) are for the distribution of Ba from the Kola C-horizon data set previ-
ously displayed as histograms (Figure 3.3). Compared to the histogram not much information is
gained. More or less the same problems are experienced as above when plotting the histograms.
Note that for plotting the density trace of the log-transformed data in the lower right display,
a log-scale was used to preserve the direct relation to the original data range (Figure 3.5).
A density trace can also be combined with a histogram to give a more realistic impression
of the data distribution.
0.015

0.015

Bandwidth = 8.4 (default)


0.010

0.010

Maximum = 1300 mg/kg


Density

Density
0.005

0.005
0.000

0.000

0 200 400 600 800 1000 1200 0 50 100 150 200 250
Ba in C−horizon [mg/kg] Ba in C−horizon [mg/kg]
0.0 0.2 0.4 0.6 0.8 1.0 1.2
0.015

Bandwidth = 0.1 (default)


Bandwidth = 3
0.010

Maximum = 1300 mg/kg


Density

Density
0.005
0.000

0 50 100 150 200 250 5 10 20 50 100 500 2000


Ba in C−horizon [mg/kg] Ba in C−horizon [mg/kg]

Figure 3.5 Density traces for Ba (aqua regia extraction) from the C-horizon of the Kola data set.
Compare to Figure 3.3
PLOTS OF THE DISTRIBUTION FUNCTION 35

3.4 Plots of the distribution function

Plots of the distribution function, e.g., the cumulative probability plot, were originally intro-
duced to geochemists by Tennant and White (1959), Sinclair (1974, 1976) and others. They
are one of the most informative graphical displays of geochemical distributions. There exist
several different plots of the distribution function that all have their merits.

3.4.1 Plot of the cumulative distribution function (CDF-plot)


Histogram and density trace are based on the density function. However, the percentage of
samples plotting above or below a certain data value x could be of interest. The percentage
equals an area under the curve of the density function. Percentages can also be expressed as
probabilities, i.e. the probability that a value smaller than or equal to the chosen data value x1
(Figure 3.6) appears is p1 . This probability can be taken as a point in a new plot, where the
data values along the x-axis are plotted against probabilities along the y-axis (Figure 3.6). By
varying the chosen value x, additional probabilities are obtained that can then be drawn as new
0.4

1.0
0.8
0.3

Probability
0.6
Density
0.2

0.4
0.1

0.2

p1 p1
0.0
0.0

x1 x1
0.4

1.0
0.8
0.3

p2
Probability
0.6
Density
0.2

0.4

p2
0.1

0.2
0.0
0.0

x2 x2

Figure 3.6 Construction of the cumulative distribution function (CDF-plot). The graphic to the left
shows the density function and the graphic to the right shows the resulting cumulative distribution
function
36 GRAPHICS TO DISPLAY THE DATA DISTRIBUTION

points in the right hand plot for the corresponding x values (e.g., p2 for the data value x2 ). If
this is carried out for the whole x-range, a new plot is generated, the cumulative distribution
function or CDF-plot (right hand diagrams).
If the data follow a normal distribution (bell shape in the left hand plot), the distribution
function will have a typical symmetrical, sigmoidal S-shape (right hand plot).

3.4.2 Plot of the empirical cumulative distribution function (ECDF-plot)


Just like trying to approach the density function via plotting histograms or density traces,
it is possible to approach the distribution function by displaying the empirical cumulative
distribution function (the ECDF-plot). The ECDF-plot is a discrete step function that jumps
with each data value by n1 , where n is the number of data points. As n becomes increasingly
large, to infinity, this function will approximate the underlying distribution function.
The ECDF-plot shows the variable values along the x-axis (either scaling according to
non-transformed data or following a transformation, e.g., logarithmic). The y-axis shows the
probabilities of the empirical cumulative distribution function between 0 and 1 (which could
also be expressed as percentages 0–100 percent).
As an example, the gold (Au) results for the Kola Project C-horizon soils are displayed
(Figure 3.7). The histogram and density trace demonstrate that the distribution is not nor-
mal but extremely right skewed, as a result the typical S-shape mentioned above for the
CDF-plot (Figure 3.7, upper right) is not present. Two very high values dominate the plot.
The ECDF-plot of the original data is still informative because it graphically displays the
distance of these two high values from the main body of the data. However, to provide a
more useful visualisation of the main body of data, a log-transformation should be applied.
Following log-transformation, the histogram and density trace still show a slight right skew
(Figure 3.7, middle left). The resulting ECDF-plot begins to display an S-shape (Figure 3.7,
middle right), however, the right skew is still clearly reflected. Instead of using a log-transform,
the plotting range of the data could be limited to focus on the main body of data (Figure 3.7,
lower left and right). This permits the study of the main body of data in far greater detail
than in the upper plot. Displaying the data in this manner requires that the viewer be re-
minded that the diagrams are displaying only a limited part of the range of the complete data
set.
One of the main advantages of the ECDF-plot is that every single data point is visible. A
geochemist would now start to search for any unusually high (or low) values and breaks in the
distribution. Very high values might, for example, indicate a mineral deposit or anthropogenic
contamination source. A break in the distribution could be caused by the presence of different
natural factors like geology, weathering and climate, or different contamination sources influ-
encing the data. In a large regional survey, the data may reflect both multiple natural processes
and anthropogenic sources. The ECDF-plot can be used with advantage to identify classes for
geochemical mapping that have a direct relation to the underlying statistical data distribution
via assigning class boundaries to these breaks (see Chapters 5 and 7).

3.4.3 The quantile-quantile plot (QQ-plot)


It is often quite difficult to judge whether the S-shape as displayed in the ECDF-plot indicates a
normal or a lognormal (or some other) distribution. When it is necessary to judge the underlying
distribution of the empirical data, a different plot is required, and it is advantageous to change
PLOTS OF THE DISTRIBUTION FUNCTION 37

1.0
0.8

0.8
0.6

Probability
0.6
Density
0.4

0.4
0.2

0.2
0

0.0
0 50 100 150 0 50 100 150
Au [μ g/kg] Au [μ g/kg]

1.0
0.8
1

Probability
0.6
Density
0.5

0.4
0.2
0

0.0

0.05 0.5 1 5 10 50 100 0.05 0.5 1 5 10 50 100


Au [μg/kg] Au [μg/kg]
1.0
0.8

0.8
0.6

Maximum = 147.5 μg/kg


Probability
0.6
Density
0.4

0.4

Maximum = 147.5 μg/kg


0.2

0.2
0

0.0

0 1 2 3 4 5 6 0 1 2 3 4 5 6
Au [μg/kg] Au [μg/kg]

Figure 3.7 Combination of histogram, density trace, and one-dimensional scatterplot (left hand side)
and ECDF-plot (right hand side) used to study the distribution of Au in the Kola Project C-horizon soil
data set. Upper diagrams: original scale, middle diagrams: log-scale, lower diagrams: truncated plotting
range

the y-axis scaling while keeping the x-axis the same. It is easiest to detect changes from an
expected distribution if the points fail to follow a straight line. Thus the best approach is to
change the y-axis in such a way that the plotted points fall on a straight line if they follow
the assumed distribution (normal, lognormal, or any other). To achieve this, the cumulative
distribution function must be transformed. A non-linear transformation, the inverse of the
expected distribution function of the y-axis, is used for this purpose. The values of the inverse
38 GRAPHICS TO DISPLAY THE DATA DISTRIBUTION

3
Quantiles of standard normal distribution
Quantiles of lognormal distribution

20

2
15

1
0
10

−1
5

−2
−3
0

0 50 100 150 0.1 1 10 100


Au [µg/kg] Au [µg/kg]

Figure 3.8 QQ-plots for Au in the Kola Project C-horizon soil data set as shown in Figure 3.6. Left
hand side: x-axis non-transformed data, y-axis quantiles of the lognormal distribution; right hand side:
x-axis log-transformed data, y-axis quantiles of the normal distribution

of the expected distribution function are called quantiles. The sorted data values along the
x-axis can be considered as the quantiles of the empirical data distribution.
Quantiles are the linear measure underlying the non-linear distribution of the probabilities.
Quantiles are expressed as positive and negative numbers, they can be compared to standard
deviation units (see Chapter 10). This plot is called the quantile-quantile (QQ-) plot because
quantiles of the data distribution are plotted against quantiles of the hypothetical normal or
lognormal distribution. Figure 3.8 shows the QQ-plot for the Kola C-horizon Au data. In the
left hand diagram the original data are plotted along the x-axis. In the right hand diagram the
log-transformed values are plotted along the x-axis to reduce the impact of the two extreme
values. By changing the scaling of the y-axis, it is possible to check the distribution for log-
normality in both diagrams. When plotting the original data, the y-axis is scaled according
to the quantiles of the lognormal distribution. When plotting the log-transformed data, the
y-axis is scaled according to the quantiles of the normal distribution. This is possible because
the lognormal distribution is simply the logarithm of the normal distribution. For the standard
normal distribution 0 corresponds to the MEDIAN and the RANGE [−1, 1] indicates the
inner two-thirds of the data distribution. Unfortunately, the construction and interpretation
of the QQ-plot for different data distributions requires statistical knowledge about these data
distributions and their quantiles that is far beyond the scope of this book. The interested reader
can consult Chambers et al. (1983).
In general, when checking for other distributions (e.g., Poisson or gamma distributions), the
x-axis is always scaled according to the original data, and only the y-axis is changed according
to the quantiles of the expected distribution.
When plotting empirical data distributions in the QQ-plot, it may still be quite difficult to
determine whether the data points really follow a straight line. A simple and robust way to plot
a straight line into the diagram is to connect first and third quartiles of both axes. In addition
to the straight line 95 percent confidence intervals around that line can also be constructed.
The confidence interval encloses 95 percent of the data points that could have been drawn
from the hypothetical distribution. These limits can be used to support a graphical decision
PLOTS OF THE DISTRIBUTION FUNCTION 39

3
Quantiles of standard normal distribution
Quantiles of lognormal distribution

20

2
15

1
0
10

−1
5

−2
−3
0

0 50 100 150 0.1 1 10 100


Au [µg/kg] Au [µg/kg]

Figure 3.9 QQ-plots as shown in Figure 3.8 with a straight line indicating the hypothetical distribution
and the 95 percent confidence intervals as dashed lines

as to whether the empirical data originate from the hypothetical distribution. The width of
the confidence interval varies because the “allowed” deviation of data points from the straight
line depends on the number of samples. In the example plot (Figure 3.9) the upper tail of the
distribution clearly deviates from log-normality.

3.4.4 The cumulative probability plot (CP-plot)


The examples above demonstrate that a scale forcing the points in the plot to follow a straight
line is useful. Because in the QQ-plot the scaling of the y-axis is different from distribution
to distribution, it would be much easier if the y-axis were expressed in probabilities. In the
pre-computer times such a plot was constructed on probability paper that was especially de-
signed for a normal (or lognormal) distribution. This procedure was originally introduced to
geochemists by Tennant and White (1959), Sinclair (1974, 1976) and others. The graph in the
plot is exactly the same as in the QQ-plot. When using a computer, it is no longer necessary to
limit the QQ-plot to a normal (lognormal) distribution. Any other distribution could be intro-
duced for scaling the y-axis. If the scale on the y-axis is expressed in probabilities rather than in
quantiles, the plot is generally named the cumulative probability plot (CP-plot) (Figure 3.10).
Note that when checking for normality, the probabilities as expressed along the y-axis can
never reach zero or one because these values would correspond to quantiles of minus infinity
or plus infinity, respectively.
The CP-plot with log-scale for the data (Figure 3.10, right hand side) is especially use-
ful because it allows direct visual estimation of the MEDIAN (50th percentile) or any
other value from the x-axis or the percentage of samples falling below or above a cer-
tain threshold (e.g., a maximum admissible concentration (MAC)) from the y-axis. It also
allows the assigning of a percentage to any break in the curve; in the example several
breaks in the Au data are visible, the first one occurs at about 85 per cent (c. 1.5 ␮g/kg
Au), the next at 95 per cent (c. 3 ␮g/kg Au) of the data. Just as in the QQ-plot, the
straight line to judge whether the empirical data follow the hypothetical distribution can be
shown.
40 GRAPHICS TO DISPLAY THE DATA DISTRIBUTION

Probabilities of standard normal distribution


99.9
Probabilities of lognormal distribution

95 99
80
50
99

20
5
90

1
0.1
0.1

0 50 100 150 0.1 1 10 100

Au [µg/kg] Au [µg/kg]

Figure 3.10 CP-plots for Au in the Kola Project C-horizon soil data set as shown in Figures 3.6, 3.7
and 3.8. Left hand side: x-axis non-transformed data, y-axis probabilities in per cent; right hand side:
x-axis log-transformed data, y-axis probabilities in per cent

3.4.5 The probability-probability plot (PP-plot)


Yet another version of these diagrams is the probability-probability (PP-) plot. Instead of
plotting quantiles of the hypothetical distribution against the quantiles of the data distribution
at fixed probabilities (QQ-plot), the probability of the hypothetical distribution is plotted against
the probability of the empirical data distribution at fixed quantiles (PP-plot). The advantage
of the PP-plot is that while the QQ-plot and the CP-plot can be dominated by extreme values,
these cannot dominate the PP-plot because of their low probability. The PP-plot will thus focus
the attention on the main body of data. In the example plot (Figure 3.11) an additional flexure
1.0

1.0
Probabilities of standard normal distribution
Probabilities of lognormal distribution

0.8

0.8
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Probability of Au Probability of log10(Au)

Figure 3.11 PP-plots for Au in the Kola Project C-horizon soil data set as shown in Figures 3.7, 3.8, 3.9,
and 3.10. Left hand side x-axis: probabilities referring to the non-transformed data, y-axis: probabilities
for lognormal distribution; right hand side x-axis: probabilities referring to the log-transformed data,
y-axis: probabilities for the normal distribution
BOXPLOTS 41

at about 20 percent of the Au distribution becomes visible. This is visible as a weak flexure
in the ECDF-plot (Figure 3.7) at 0.5 ␮g/kg Au. The main disadvantage of the PP-plot is that
the relation to the original data is completely lost, whereas it is retained in the ECDF- and
CP-plot. It is of course possible to use the PP-plot in combination with the CP-plot to identify
the data value for 20 percent. A combination of CP-plot and ECDF- or PP-plot may thus be
quite powerful in obtaining a more complete picture of the data distribution. Single extreme
values are of course hardly visible in the PP-plot though they are in the ECDF-plot. Just as in
the QQ- and CP-plot, a straight line can be introduced in the PP-plot to check for agreement
with a hypothetical distribution.

3.4.6 Discussion of the distribution function plots


Depending on the empirical data distribution of a given variable, the different versions of
these plots all have their merits, especially when the task is to detect fine structures (breaks
or flexures) in the data. If only one plot is to be presented, it is advisable to look first at the
different possibilities and then select the most informative for the variable under study because
this will depend on the actual data distribution of each variable. In applied geochemistry the
CP-plot with logarithmic x-axis is probably the most frequently used (Figure 3.10, right).
In combination with the less frequently used ECDF-plot and the almost never used PP-plot
(Figure 3.11, right), it holds the potential to provide a very realistic picture of the complete
data distribution.
As mentioned above, one of the main advantages of these diagrams is that each single data
value remains observable. The range covered by the data is clearly visible, and extreme outliers
are detectable as single values. It is possible to directly count the number of extreme outliers
and observe their distance from the core (main mass) of the data.
When looking at a selection of these plots for As (Figure 3.12), several data quality
issues can be directly detected; for example, the presence of discontinuous data values at the
lower end of the distribution. Such discontinuous data are often an indication of the method
detection limit or too-severe rounding, discretisation, of the measured values reported by the
laboratory. The PP-plot corresponding to the log-transformed data shows most clearly how
serious an issue this data discretisation due to rounding of the values by the laboratory can
become.
Values below the detection limit, set to some fixed value, are visible as a vertical line at
the lower end of the plots, and the percentage of values below the detection limit can be
visually estimated. The CP-plot with logarithmic scale (middle right figure) displays this best.
The detection limit for As was 0.1 mg/kg, about two per cent of all values plot below the
detection limit (Figure 3.12). From 0.1 to 1 mg/kg the As values were reported in 0.1 mg/kg
steps – obviously a too-harsh discretisation for the data at hand, causing artificial data structures
(Figure 3.12). The presence of multiple populations results in slope changes and breaks in the
plots (Figure 3.12).

3.5 Boxplots

The boxplot is one of the most informative graphics for displaying a data distribution. It is
built around the MEDIAN (see Chapter 4), which divides any data set into two equal halves.
42 1.0 GRAPHICS TO DISPLAY THE DATA DISTRIBUTION

1.0
0.8

0.8
0.6

0.6
Probability

Probability
0.4

0.4
0.2

0.2
0.0

0.0
0 5 10 15 20 25 30 0.05 0.1 0.2 0.5 1 2 5 10 20
As in C−horizon [mg/kg] As in C−horizon [mg/kg]
Probabilities of standard normal distribution

Probabilities of standard normal distribution


99 99.9

99 99.9
70 90

70 90
20 40

20 40
5

5
1

1
0.1

0.1

0 5 10 15 20 25 30 0.05 0.1 0.2 0.5 1 2 5 10 20


As in C−horizon [mg/kg] As in C−horizon [mg/kg]
Probabilities of standard normal distribution

Probabilities of standard normal distribution


1.0

1.0
0.8

0.8
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Probability of As in C−horizon Probability of log10(As)

Figure 3.12 Six different ways of plotting distribution functions. Upper row: empirical cumulative
distribution function plots (ECDF-plots); middle row: cumulative probability plots (CP-plots); lower
row: probability-probability plots (PP-plots). Left half of diagram: data without transformation; right
half of diagram: plots for log-transformed data

3.5.1 The Tukey boxplot


Tukey (1977) introduced the boxplot to exploratory data analysis. The construction of the
Tukey boxplot is best demonstrated using a simple sample data set, consisting of only nine
values:
2.3 2.7 1.7 1.9 2.1 2.8 1.8 2.4 5.9.
BOXPLOTS 43

The data are sorted to find the MEDIAN:


1.7 1.8 1.9 2.1 2.3 2.4 2.7 2.8 5.9.
After finding the MEDIAN (2.3), the two halves (each of the halves includes the MEDIAN)
of the data set are used to find the “hinges”, the MEDIAN of each remaining half:
1.7 1.8 1.9 2.1 2.3 2.4 2.7 2.8 5.9.
These upper and lower hinges define the central box, which thus contains approximately
50 percent of the data. In the example the “lower hinge” (LH) is 1.9, the “upper hinge” (UH)
is 2.7. The “inner fence”, a boundary beyond which individuals are considered extreme values
or potential outliers, is defined as the box extended by 1.5 times the length of the box towards
the maximum and the minimum. This is defined algebraically, using the upper whisker as an
example, as

Upper inner fence (UIF) = UH(x) + 1.5 · HW(x).


Upper whisker = max(x[x ≤ UIF]).

where HW (hinge width) is the difference between the hinges (HW = upper hinge–lower hinge),
approximately equal to the interquartile range (depending on the sample size), i.e. Q3−Q1
(75th − 25th percentile); and the square brackets, [. . . ] indicate the subset of values that meet
the specified criterion.
The calculation is simple for the example data:

Hinge width, HW = UH − LH = 2.7 − 1.9 = 0.8.


Lower inner fence, LIF = LH − (1.5 · HW) = 1.9 − (1.5 · 0.8) = 0.7.
Upper inner fence, UIF = UH + (1.5 · HW) = 2.7 + (1.5 · 0.8) = 3.9.

By convention, the upper and lower “whiskers” are then drawn from each end of the box to
the furthest observation inside the inner fence. Thus the lower whisker is drawn from the box
to a value of 1.7, the lower whisker and minimum value are identical, and the upper whisker is
drawn to the value of 2.8. Values beyond the whiskers are marked by a symbol, in the example
the upper extreme value of (5.9) is clearly identified as an extreme value or data outlier (see
Chapter 7) and is at the same time the maximum value.
Figure 3.13 shows a “classical” Tukey boxplot for Ba in the Kola C-horizon samples. No
lower extreme values or outliers are identified. The lower inner fence is lower than the minimum
value, and thus the lower whisker terminates at the location of the minimum value. The upper
inner fence and termination of the upper whisker fall together in this example, and all values
to the right of the upper whisker are identified as upper extreme values or outliers.
In summary, the Tukey boxplot – one of the most powerful EDA graphics – shows in
graphical form:

 the “middle” (MEDIAN) of a given data set, identified via the line in the box;
 spread (see Chapter 4) by the length of the box (the hinge width);
 skewness (see Chapter 4) by the symmetry of the box and whisker extents about the median
line in the box;
 kurtosis (see Chapter 4) by the length of the whiskers in relation to the width of the box;
 the existence of extreme values (or outliers – see Chapter 7), identified by their own
symbol.
44 GRAPHICS TO DISPLAY THE DATA DISTRIBUTION

extreme
values
or
outliers extreme values or outliers

maximum
lower inner fence

upper whisker
upper inner fence
lower hinge
MEDIAN
upper hinge
minimum
lower whisker

0 200 400 600 800 1000 1200


Ba [mg/kg]

Figure 3.13 Tukey boxplot for Ba in the Kola C-horizon soil samples

Furthermore, because the construction of the boxplot is based on quartiles, it will not be
seriously disturbed by up to 25 percent of “wild” data at either end of the distribution. It is not
even seriously influenced by widely different data distributions (Hoaglin et al., 2000).

3.5.2 The log-boxplot


It is important to recognise that the calculation of the whiskers in the above formula assumes
data symmetry, lack of which is easily recognised by the median line not being close to the
middle of the box. The recognition of extreme values and outliers is based on normal theory,
and the standard deviation (SD) of the distribution is estimated via the hinge width (HW)
or interquartile range (IQR). So for the estimate of SD to be appropriate, there has to be
symmetry in the middle 50 percent of the data (see Chapter 4). Thus for strongly right-skewed
data distributions, as frequently occur in applied geochemistry, the Tukey boxplot based on
untransformed data will tend to seriously underestimate the number of lower extreme values
and overestimate the number of upper extreme values.
Figure 3.13 demonstrates that the boxplot detects a high number of upper extreme values
for the Ba data and no lower extreme values. The reason for this is due to the right-skewed
data distribution, indicated by the MEDIAN falling only one-third of the hinge width (HW)
above the lower hinge (LH). This feature was also apparent in the histogram and density
trace (Figures 3.3 and 3.5). These figures demonstrate that the Ba distribution approaches
symmetry when the data are log-transformed. The Tukey boxplot of the log-transformed data
will thus be suitable for providing a realistic estimate of the extreme values at both ends of the
data distribution. Figure 3.14 shows the Tukey boxplot for the log-transformed Ba data. As
expected, the number of upper extreme values is drastically reduced, and one lower extreme
value is now identified (Figure 3.14, upper diagram). It is of course possible to plot a log-scale
for the original data to regain the desirable direct relationship (Figure 3.14, middle). Because
the boxplot is reliable for symmetric distributions, it is appropriate to calculate the values for
the whiskers for the log-transformed distribution and then back-transform the values for the
fences to the original data scale (Figure 3.14 lower diagram). Note that the MEDIAN and
hinges will not be changed by log- and back-transformation because they are based on order
statistics. This version of the boxplot is called the log-boxplot and should be used when the
BOXPLOTS 45

1.0 1.5 2.0 2.5 3.0


log10(Ba) [mg/kg]

5 10 20 50 100 200 500 1000


Ba [mg/kg]

0 200 400 600 800 1000 1200


Ba [mg/kg]

Figure 3.14 Boxplots for Ba. Upper diagram: the boxplot of the log-transformed data. Middle diagram:
the same but scaled according to the original data. Lower diagram: log-boxplot of the original data
with whiskers calculated according to the symmetrical log-transformed data and back-transformed to
the original data scale – compare with the Tukey boxplot in Figure 3.13

original data are strongly skewed and it is still desirable to preserve the data scale. Comparison
of Figure 3.13 with Figure 3.14 (lower diagram) demonstrates that the limits of the box are not
changed; however, the extent and position of the whiskers and the number of extreme values
has changed dramatically.
In conclusion, the Tukey boxplot should not be applied to strongly skewed data distributions
without an appropriate data transformation because it will result in a wrong impression about
the upper and lower extreme values. Thus for applied geochemical data, the data distribution
should always be checked for symmetry before drawing either a Tukey boxplot or the log-
boxplot. In the majority of cases the log-boxplot (Figure 3.14, lower plot) will be better suited
to identify the number and position of extreme values.
46 GRAPHICS TO DISPLAY THE DATA DISTRIBUTION

If the log-transformed data still deviate from symmetry, versions of the boxplot exist that
can deal with strongly skewed data sets and will still provide useful fences for extreme values
at both ends of the data distribution. They are based on a robust measure of skewness (Section
4.4) for the calculation of the fences for the lower and upper whiskers (Vandervieren and
Hubert, 2004).

3.5.3 The percentile-based boxplot and the box-and-whisker plot


The data symmetry problems with the Tukey boxplot are probably the reason why some workers
prefer to use a modified version, the percentile-based boxplot, where all definitions are based
on percentiles (see Chapter 4): MEDIAN and 25th and 75th percentile for the box and 2nd (or
5th to 20th ) and 98th (or 80th to 95th ) percentile for the whiskers.
However, when using a percentile-based boxplot, one of the major advantages of the Tukey
boxplot is lost, i.e. the “automatic” identification of extreme values. The percentile boxplot
will always identify a certain percentage of extreme values while it is possible that no extreme
values will be identified with the Tukey boxplot. When studying boxplots, it is essential to be
aware of the exact conditions used for their construction.
Because the Tukey boxplot, log-boxplot, and percentile-based boxplot all look the same to
the unsuspecting reader (compare Figures 3.13, 3.14 and 3.15), it should be good practice to
explain in the figure caption which version of the plot was used.
In the box-and-whisker plot (see, e.g., Garrett, 1988) the whiskers are drawn to stated
percentiles, e.g., the 5th and 95th , and the minima and maxima plotted as crosses. No attempt
is made to identify extreme values. The box-and-whisker plot is simply a graphical summary
of the data based on the order statistics (percentiles) with no assumptions concerning the

0 200 400 600 800 1000 1200

Ba [mg/kg]

0 200 400 600 800 1000 1200

Ba [mg/kg]

Figure 3.15 Percentile-based boxplot using the 5th and 95th (upper boxplot) and 2nd and 98th
percentile (lower boxplot) for drawing the whiskers. Variable Ba, Kola Project C-horizon; compare with
Figures 3.13 and 3.14
BOXPLOTS 47

n = 606

0 200 400 600 800 1000 1200

Ba [mg/kg]

Figure 3.16 Box-and-whisker plot of the variable Ba, Kola Project C-horizon

underlying statistical model. Figure 3.16 is the box-and-whisker plot for Ba in Kola Project
C-horizon soils. The resulting plot is the same as the boxplot shown in Figure 3.15, upper
boxplot, without identifying all outliers or extreme values.

3.5.4 The notched boxplot


Often the information included in the Tukey boxplot is extended by adding an estimate of the
95 percent confidence bounds on the MEDIAN. This leads to a graphical test of comparability
0.015

1.0
0.8
0.01

Probability
0.6
Density
0.005

0.4
0.2
0

0.0

0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200

Ba [mg/kg] Ba [mg/kg]
1.0
0.8
1

Probability
0.6
Density
0.5

0.4
0.2
0

0.0

5 10 50 100 500 5 10 50 100 500

Ba [mg/kg] Ba [mg/kg]

Figure 3.17 Histogram, density trace, one-dimensional scatterplot, and boxplot in just one display,
combined with the ECDF-plot. Variable Ba, Kola Project C-horizon. Upper diagrams: original data (with
log-boxplot); lower diagrams: log-transformed data
48 GRAPHICS TO DISPLAY THE DATA DISTRIBUTION

– much like the more formal t-test (see Chapter 9) – of MEDIANS via notches in the boxplot.
This use of boxplots is discussed in Section 9.4.1.

3.6 Combination of histogram, density trace, one-dimensional scatterplot,


boxplot, and ECDF-plot

Several of the plots named so far – one-dimensional scatterplot, histogram, density trace and
boxplot – can advantageously be combined into just one display (Reimann, 1989). Figure 3.17
(left) shows this combined graphic for Ba. The ECDF-plot is another graphic that will reveal
interesting properties of the data distribution (Figure 3.17, right). In contrast to the QQ-, CP-,
or PP-plot, the ECDF-plot is not based on the assumption of any underlying data distribution.
It is thus ideally suited as an exploratory data analysis tool in the first stages of data analysis.
In combination these graphics provide an excellent first impression of the data, as each of
them highlights a different aspect of the data distribution (Figure 3.17). Figure 3.17 shows that
Ba is strongly right skewed for the original data. It displays an almost symmetrical distribution
for the log-transformed data. For the log-transformed data both the one-dimensional scatterplot
and ECDF-plot show some minor disturbances at the lower end of the Ba distribution. For
further work with most statistical methods requiring symmetrical data, the log-transformed

Histogram Tukey boxplot


350
250
Frequency
150
0 50

0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200

Ba in C−horizon [mg/kg] Ba [mg/kg]

ECDF−plot CP−plot
1.0

5 20 60 90 99
Probabilities of standard
0.8

normal distribution
Probability
0.6
0.4
0.2

0.1
0.0

0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200

Ba [mg/kg] Ba [mg/kg]

Figure 3.18 Combination of histogram, Tukey boxplot, ECDF-, and CP-plot for the variable Ba, Kola
Project C-horizon
COMBINATION OF HISTOGRAM, BOXPLOT OR BOX-AND-WHISKER PLOT, ECDF-PLOT, AND CP-PLOT 49

values of this variable should obviously be used. One could also conclude that the log-boxplot
should be used for further work with this variable for a realistic identification of extreme values
whenever the data are kept in the original scale.

3.7 Combination of histogram, boxplot or box-and-whisker plot,


ECDF-plot, and CP-plot

A different combination of the plots in this chapter has also proven informative in a single
display (Figure 3.18). Using the data for Ba in Kola C-horizon soils, the histogram in the
upper left is the same as in Figure 3.17. The upper right display is either a Tukey boxplot
or a box-and-whisker plot, the user’s choice. The lower left display is an ECDF-plot and the
lower right a CP-plot. The ECDF-plot and CP-plot permit easy inspection of the middle and
extreme parts of the data distribution, respectively. The choice of Tukey boxplot or box-and-
whisker plot depends on whether the user wishes to identify potential outliers or simply have
an order statistics based replacement for the histogram. The plotting of the Tukey boxplot
above the CP-plot permits easy comparison of the two plots. Figure 3.18 indicates that all plots
are dominated by some few extreme values. Thus the whole display should be plotted using
logarithmic scaling (Figure 3.19).

Histogram Tukey boxplot


80
60
Frequency
40
20
0

5 10 50 100 500 5 10 20 50 100 500


Ba in C−horizon [mg/kg] Ba [mg/kg]

ECDF−plot CP−plot
1.0

5 20 60 90 99
Probabilities of standard
0.8

normal distribution
Probability
0.6
0.4
0.2

0.1
0.0

5 10 20 50 100 500 5 10 20 50 100 500


Ba [mg/kg] Ba [mg/kg]

Figure 3.19 The same plot as Figure 3.18 using logarithmic scales
50 GRAPHICS TO DISPLAY THE DATA DISTRIBUTION

3.8 Summary

When working with a new data file it is highly advisable to study the data distribution in detail
before proceeding to all subsequent statistical analyses. Many statistical methods are based
on assumptions about the data distribution. A documentation of the data distribution will help
to decide which statistical methods are appropriate for the data at hand. A histogram alone is
not sufficient to get a good impression of the data distribution. The impression gained from
histogram and density trace alone depends strongly on the choice of parameters. A combination
of different graphics will often provide greater insight into the data distribution. All variables
should thus be documented in a combination of summary plots, e.g., histogram combined with
density trace, boxplot and one-dimensional scatterplot and ECDF- or CP-plots.
It is advisable to have copies of a number of distribution graphics for all variables under
study at hand for easy reference when more advanced data analysis methods are applied.
4
Statistical Distribution Measures

When studying the data distribution graphically, it is apparent that the distributions of the
variables can look quite different. Instead of looking at countless data distributions, it may
be desired to characterise the data distribution by a number of parameters in a table. What is
required?
First the central value of the distribution needs to be identified, together with a measure
of the spread (variation) of the data. Furthermore, the quartiles and different percentiles of
the distribution may be of interest (i.e. above what value fall the uppermost two, five, or ten
per cent of the data).
When working with “ideal” data, two further measures are often provided in statistical
tabulations: skewness and kurtosis. Skewness is a measure of the symmetry of the data dis-
tribution, kurtosis provides an expression of the curvature – the appearance of the density
trace can be flat or steep. Such “summary values” are frequently used to compare data from
different investigations. For real data there is often the problem that the presence of outliers
and/or multimodal distributions will bias these measures. Both can be easily recognised from
the graphics described in the previous chapter.

4.1 Central value

What is the most appropriate estimator for the central value of a data distribution? What is
actually the central value of a distribution? It could be the “centre of gravity”, it could be the
most likely value, it could be the most frequent value, it could be the value that divides the
samples into two equal halves. Accordingly there exist several different statistical measures of
the central value (location) of a data distribution.

4.1.1 The arithmetic mean


The most frequently used measure is probably the arithmetic mean. Throughout this book we
will use the term MEAN for the arithmetic mean. All values are summed up and divided by
the number of values. If x1 , x2 , . . . , xn denotes the values of the data with n indicating the

Statistical Data Analysis Explained: Applied Environmental Statistics with R. C. Reimann, P. Filzmoser, R. G. Garrett,
R. Dutter © 2008 John Wiley & Sons, Ltd. ISBN: 978-0-470-98581-6
52 STATISTICAL DISTRIBUTION MEASURES

number of samples, the simple formula for calculating the arithmetic mean is:
n
1
MEAN = xi .
n
i=1

4.1.2 The geometric mean


The geometric mean G is often used with advantage for right-skewed (e.g., lognormal) distri-
butions. G is calculated by taking the n-th (n = number of samples) root of the product of all
the data values. It requires that all values are positive, negative values or zeros in the data set
are not permitted. When dealing with applied geochemical data, this is the usual case, negative
concentrations of a chemical element cannot exist, and the concentration is always conceptu-
ally greater than zero, even if it is so low that it cannot be measured. An exception may be
measurements of organic pollutants that do not occur in nature, where the concentration could
actually be zero. The formula is:

 n
√ 
G = n x1 · x2 · · · xn or  n
xi .
i=1

Using this form of calculation should be avoided as rounding errors may occur due to the
arithmetic precision limitations of computers. A preferable method is to use logarithms, then:
n
1
mean = log(xi ) and G = 10mean or G = emean
n
i=1

depending on whether logarithms to the base 10 or natural logarithms were used.

4.1.3 The mode


The MODE is the value with the highest probability of occurrence. There is no simple formula
to estimate the mode; however, the mode is often estimated from a histogram or density trace –
the MODE being the value where the histogram or density function shows a maximum.

4.1.4 The median


The MEDIAN divides the data distribution into two equal halves. The data are sorted from the
lowest to the highest value, and the central value of the ordered data is the MEDIAN. In the
case that n is an even number, there exist two central values and the average of these two values
is taken as the MEDIAN. This may best be demonstrated by a simple example:
2.3 2.7 1.9 2.1 1.8 2.4 2.0 5.9.
The data are then sorted:
1.8 1.9 2.0 2.1 2.3 2.4 2.7 5.9.
The two central values are:
1.8 1.9 2.0 2.1 2.3 2.4 2.7 5.9,
and the MEDIAN is (2.1 + 2.3)/2 = 2.2.
CENTRAL VALUE 53

For comparison, the MEAN of these eight values is 2.64, the geometric mean, G, is 2.44,
and to estimate a MODE does not make much sense when dealing with so few data. Looking at
the differences between the MEAN, G, and MEDIAN demonstrates that it may be difficult to
define a meaningful central value. These large differences between the estimates of the central
value are caused by the one high value (5.9) that clearly deviates from the majority of data.
When computing the MEAN, each value enters the calculation with the same weight. The high
value of 5.9 thus has a strong influence on the MEAN. The logarithm of the geometric mean
G is the arithmetic mean of the log-transformed data. Log-transforming (base 10) the above
data:
0.26 0.28 0.30 0.32 0.36 0.38 0.43 0.77;
the MEAN of these log-transformed values is 0.39. When this value is transformed back
(100.39 ), the value of the geometric mean G, 2.44, is obtained. Thus the same problem observed
for the MEAN remains, G is attracted by the (still) extreme value of 0.77, though to a lesser
extent due the nature of the logarithmic transformation. Using the above example, the difference
between the MEDIAN and maximum is 3.7 units, but in logarithmic units the difference is
only 0.43. Clearly, when calculating the arithmetic mean, the maximum value will have less
leverage when using logarithmically transformed data.
Because the MEDIAN is solely based on the sequence of the data values, it is not affected
by extreme values – the extreme value could even be much higher without any influence on
the MEDIAN. The MEDIAN would not even change when not only the largest value but also
the next two lower values were much higher (or the lowest values much lower). For large n the
MEDIAN is resistant to up to 50 per cent of the data values being extreme. Thus the MEDIAN
may be the best measure of the central value if dealing with data containing extreme values.

4.1.5 Trimmed mean and other robust measures of the central value
To this point four methods of obtaining a central value have been discussed. However, others
exist, such as calculating a trimmed mean. Here a selected proportion of extreme data values
are excluded before calculating the mean. Thus the five per cent trimmed mean would be the
mean of the data between the 5th and 95th percentiles, i.e., the top and bottom five per cent of
the data have been discarded from the calculation. Because the exact proportion that should be
trimmed is unknown, the procedure introduces an amount of subjectivity into computing the
mean. Graphical inspection of the data distribution, e.g., in the CP-plot, can be very helpful in
deciding on the trimming percentage. There are other robust estimators (i.e. estimators that are
less influenced by data outliers and deviations from the model of a normal data distribution) of
the central value, based on the M-estimator (Huber, 1964). Data points that are far away from
the centre of the distribution are down-weighted by these robust estimators. The M-estimator
is less robust than the MEDIAN. However, because the MEDIAN uses less information it is
less precise in estimating the central value of the underlying distribution.

4.1.6 Influence of the shape of the data distribution


It has been demonstrated how the different measures of the central value behave when the data
contain one (or several) extreme value(s). How does the shape of the distribution influence the
central value? Figure 4.1 shows, starting with a normal distribution, how the different measures
of the central value depend on the shape of the data distribution.
54 STATISTICAL DISTRIBUTION MEASURES

Normal distribution Lognormal distribution

0
MEDIAN
MODE
MEAN

MODE
G
MEDIAN
MEAN
G

Student t distribution, df=5 Chi−square distribution, df=5

0
MODE
G
MEDIAN
MEAN
MEAN
G
MODE
MEDIAN

Exponential distribution Multimodal distribution

0
MODE

MEDIAN
MEAN
MODE
G
MEDIAN
MEAN

Figure 4.1 Graphical presentation of a selection of different statistical data distributions (normal,
lognormal, chi-square, Student t, exponential and multimodal) and the location of the four different
measures of the central value discussed in the text

In the case of a normal distribution, all four measures will, theoretically, take the same
value (Figure 4.1), though there may be slight differences depending on the actual data. For a
normal distribution the MEAN is the best (most precise) central value.
For a lognormal distribution important differences occur – the arithmetic mean is strongly
influenced by high values (Figure 4.1). The MEDIAN and G are (theoretically) equal. The
reason is that the logarithm of the lognormal distribution is a normal distribution. Thus the
geometric mean G of the lognormal distribution corresponds to the MEAN of the normal
distribution. For a normal distribution the MEAN is the best measure of the central value. Thus
G can be considered as the best measure of the central value of a lognormal distribution. The
CENTRAL VALUE 55

position of the MEDIAN will not be influenced by a logarithmic transformation. Thus we get
the same value for G and MEDIAN. The MODE is lower than all other measures and identifies
the peak of the lognormal distribution. Still, the MODE may not be the best measure of the
central value because far fewer data occur below the mode than above. The MEDIAN still
separates the data into two equal halves and is, together with the geometric mean G, a good
measure of the central value for a lognormal distribution. The most precise measure is again
the MEAN, calculated for the log-transformed data and then back-transformed to the original
data scale.
The Student t distribution is a symmetrical distribution with a different shape than the normal
distribution (Student (W.S. Gosset), 1908). The likelihood that it contains values that are far
away from the centre of the distribution is much higher; it has “heavy tails”. The “heaviness” of
the tails depends on a property named the degrees of freedom (df) (see, e.g., Abramowitz and
Stegun, 1965). A small value of df results in very heavy tails. With ever increasing df the shape
of the normal distribution is approached. Because the distribution is symmetrical, in theory all
four measures of the central value are again equal (Figure 4.1). However, in practice the Student
t distribution describes data with many extreme values, and thus the measures can strongly
deviate from each other. Of the measures discussed here, the MEDIAN is the only reliable
measure of the central value because it is resistant to a high proportion of extreme values.
The chi-square distribution is a right-skewed distribution (see, e.g., Abramowitz and Stegun,
1965). As for the t distribution, the parameter df (degrees of freedom) determines the shape.
With increasing df the distribution becomes more and more symmetric and will in the end
approximate the normal distribution. All four measures of the central value will usually differ
(MODE < G < MEDIAN < MEAN – Figure 4.1). The MEAN is influenced by the high
values (right-skewed distribution). The geometric mean G will be influenced by the low values
because the log-transformed values of a chi-square distribution are left skewed. The MODE
identifies the peak, but in a right-skewed distribution fewer values will occur below the MODE
than above. The MEDIAN is thus the best measure of the central value.
The exponential distribution is again right skewed, and the same sequence of the measures
of the central value as for the chi-square distribution is observed. The reasons are the same.
Here the MODE is zero and uninformative. Again the MEDIAN is the best measure of the
central value.
A multimodal distribution has more than one peak. All four measures of the central value
will generally provide different locations. In the case shown in the example plot (Figure 4.1),
it is very difficult to decide which measure is the best indication of the central value because
three data populations are clearly present where the distribution functions are superimposed on
one-another. It would be best to separate the three distributions and estimate the central value
of each distribution individually. This is impractical because of the extent of the area of overlap
and one central value for the whole distribution has to be estimated, even knowing that the
central value will not be meaningful for any of the three underlying distributions. The MODE
will be meaningful as a central value for one population. MEAN and G can both be strongly
influenced by outliers or skewness. The MEDIAN still divides the resulting total distribution
into two equal halves and may thus again be the best measure of the central value. Note that it
is impossible to transform this distribution to approach symmetry.
In conclusion, it can be stated that the MEDIAN is the most suitable measure of the
central value when dealing with distributions with different shapes (i.e. when working with
real data). It is also preferable because it is robust against a sizeable proportion of extreme
values.
56 STATISTICAL DISTRIBUTION MEASURES

4.2 Measures of spread

The central value can be used to compare data. However, even with the same central value,
the data can exhibit completely different distributions. An additional measure of variation is
required for a better description of the data.
How can the variation of data be characterised? Statisticians define the variance as the
average squared deviation from the central value. For practical purposes variance may not be
the most usable measure because it is expressed in squared units. A measure expressed in the
same unit as the data may thus be desired. Such a measure is called a “measure of spread”
rather than the measure of variance.
One possibility to estimate a measure of spread is to look at the interval into which the data
fall, i.e. maximum value minus minimum value. To make this measure more robust against
(i.e. less influenced by) extreme values one could choose to look at the interval defined by
the “inner” 50 per cent of the data. Another possibility is to consider the variation of the data
points around the central value.

4.2.1 The range


The RANGE is defined as maximum (MAX) minus the minimum (MIN) and thus gives directly
the interval length from the smallest to the largest values:

RANGE = MAX − MIN.

However, real data often contain extreme values (data outliers). The RANGE is extremely
sensitive to outlying data and thus not usually a good descriptor of the data variation when
working with applied geochemical or environmental data.

4.2.2 The interquartile range (IQR)


A more robust equivalent of the RANGE is the interquartile range, IQR. Instead of calculating
MAX − MIN, the difference between the 3rd and the 1st quartile of the data is calculated. Thus
the upper and lower 25 per cent of the data are not used in the calculation; the IQR measures the
range of the inner 50 per cent of the data. To reach conformity with the spread measure of the
underlying data distribution, the IQR needs to be multiplied by a constant. For an underlying
normal distribution the value is 0.7413. The resulting formula for the spread-measure IQR of
a normal distribution is:

IQR = 0.7413 · (Q3 − Q1),

where Q1 and Q3 are 1st and 3rd quartiles of the data.


The IQR is robust against 25 per cent of data outliers or extreme values, and thus considerable
deviations in the tails of the normal distribution can be tolerated. Because the IQR is robust, it
can be used as a measure of spread even for real data that are skewed and have a high proportion
of outliers.
MEASURES OF SPREAD 57

4.2.3 The standard deviation


The standard deviation, SD, considers (just like the MEAN) each single data point. The average
of the squared deviations of the individual values from the arithmetic mean is computed. This
describes the average spread (n is replaced by n − 1 to correct for the estimated mean) of the
data around the central value. To report the measure in the original data units, the square root
of the results is defined as the SD:

 n
 1 
SD =  (xi − MEAN)2 .
n−1
i=1

When calculating the standard deviation for real data, the problem again is that each data value
has the same weight. If the data are skewed and/or extreme values are present, not only will the
MEAN be biased (see above), but the SD will be even more biased (because squared deviations
are used to estimate it). Because of this fact, the SD should never be calculated for data without
a prior check of the distribution. If the data do not show a symmetrical distribution, it is most
prudent to not estimate and provide a SD.
To get a meaningful estimate of the standard deviation, the distribution must approach
normality before any calculations (see Chapter 10 on transformations). In the case of data
with a normal distribution, the SD is the best measure of spread. In applied geochemistry
and the environmental sciences, the data are sometimes strongly right skewed, and then
the standard deviation is not a useful measure of spread. The MEAN can be calcu-
lated for the log-transformed data and then re-transformed to the original data scale. The
spread (expressed as SD), however, cannot be calculated for the log-transformed data
and then back-transformed to the original data scale, because the scale is changing un-
der transformation (see Figure 3.3, histogram, and compare range for the original data
(RANGE = MAX(Ba) − MIN(Ba) = 1300 − 4.7 = 1295.3) and for the log-transformed data
(RANGE = MAX(log10(Ba)) − MIN(log10(Ba)) = 3.11 − 0.67 = 2.44)).

4.2.4 The median absolute deviation (MAD)


The robust equivalent of the SD is the median absolute deviation, MAD. It is also a measure
of average deviation from the central value. In contrast to the SD, for calculating the MAD
the MEDIAN is taken as the central value. The absolute deviations are computed and their
average estimated with their MEDIAN. Unlike the standard deviation, and like the IQR-based
estimate of spread, it is not necessary to take the square root because the MAD is already in
the same unit as the data. The median of the absolute deviations provides a value that again
has to be multiplied by a constant so that it conforms to the underlying distribution, to provide
the MAD. For the normal distribution the constant is 1.4826.

MAD = 1.4826 · mediani (|MEDIAN − xi |).

The MAD is robust against up to 50 per cent of extreme values in the data set because it is
based solely on MEDIANS. Due to the robustness of the MAD considerable deviations from
the normal distribution are tolerable.
58 STATISTICAL DISTRIBUTION MEASURES

When looking at the above example data set (Section 4.1.4):

1.8 1.9 2.0 2.1 2.3 2.4 2.7 5.9,


MEDIAN = 2.2;

to estimate the MAD, the difference between the values and the MEDIAN is calculated, namely:
−0.4 −0.3 −0.2 −0.1 +0.1 +0.2 +0.5 +3.7.
Then the differences are arranged in order of magnitude (neglecting the sign) and the MEDIAN
of these ordered values is determined:

0.1 0.1 0.2 0.2 0.3 0.4 0.5 3.7,


MEDIAN = 0.25.

The MEDIAN of the deviations is 0.25. Note that this value is again not influenced by the
absolute value of the outlier. The MAD is calculated by multiplying the MEDIAN of the
deviations with the “normal distribution” factor (1.4826, see above).
MAD = 1.4826 · 0.25 = 0.37.
Just for comparison the RANGE of our simple data set is 4.1, the SD is 1.35, and the IQR is:
IQR = 0.7413 · (2.55 − 1.95) = 0.44.
Note that if we use R to calculate the IQR, we will get a slightly different result (IQR = 0.37)
because R uses a better approximation of the quartiles. As mentioned above, just like IQR, the
MAD can be applied as a useful measure of spread for real data even when working with the
untransformed (original) data.

4.2.5 Variance
Range, IQR, SD, and MAD all have a direct relation to the unit of the data. All these measures
of spread can be used to estimate the variance by simply taking the square of the obtained
values. The “sample variance” is defined as
n
1 
VAR = (xi − MEAN)2 ,
n−1
i=1

and the SD is calculated as the square root of VAR.

4.2.6 The coefficient of variation (CV)


The coefficient of variation (relative standard deviation) is independent of the magnitude of
the data and thus of the data measurement units. It is usually expressed in per cent (CV values
times 100 in percent). It is well suited to compare the variation of data measured in different
units, or the variation in data sets where the MEDIANS are quite different. The coefficient of
variation (CV), also known as the relative standard deviation (RSD), is defined as:

SD
CV % = 100 · .
MEAN
KURTOSIS 59

4.2.7 The robust coefficient of variation (CVR)


The robust coefficient of variation, CVR, is defined as:

MAD
CVR % = 100 · .
MEDIAN

Just as the CV, it is usually expressed in per cent.

4.3 Quartiles, quantiles and percentiles

Quartiles, quantiles and percentiles have already been used in the previous chapter when dis-
cussing QQ-, CP- and PP-plots and constructing a boxplot. Quartiles divide the data distribution
into four equal parts. One-quarter of the data occur below the first quartile, three-quarters of the
data above the first quartile. The MEDIAN has been defined in Section 4.1.4 – half of the data
occur above the MEDIAN, accordingly the other half of the data occur below the MEDIAN.
Three-quarters of the data occur below the third quartile and one-quarter above.
For studying the tails of a given data distribution, one could be interested in other fractions
than quartiles. Generally, an α-quantile is defined as the value where a fraction α of the data is
below and the fraction 1 − α is above this value. Values of α = 0.02, 0.05, 0.1, 0.9, 0.95, and
0.98 are often of special interest to identify extreme values of a given data distribution.
Percentiles are defined analogously to quantiles. For percentiles the sought fractions of α
are expressed in percent, i.e. 2, 5, 10, 90, 95, and 98 per cent of the data distribution.
The position of quartiles, quantiles, and percentiles in a given data set is not changed
by strictly monotone transformations like the log-transformation. This fact was used when
constructing the log-boxplot (Section 3.5).

4.4 Skewness

Skewness is a measure of the symmetry of the data distribution, it is computed as the third
moment about the mean (the variance is the second moment about the mean) (see, e.g.,
Abrramowitz and Stegun, 1965). Skewness is zero if a distribution is symmetric, it takes
on negative values for left-skewed data and positive values for right-skewed data. The value
in itself conveys little. Skewness should be standardised with a constant, depending on the
sample size. The standardised skewness should be in the range ±2, otherwise the skewness
must be considered to be extreme.

4.5 Kurtosis

Kurtosis is a measure that will indicate whether the data distribution is flat or peaked, it is
computed as the fourth moment about the mean (see, e.g., Abrramowitz and Stegun, 1965).
For a normal distribution the normalised kurtosis is zero, the “raw” kurtosis is three. It takes
negative values for peaked (most values in the centre of the distribution) and positive values
for flat (many values towards the tails of the distribution) distributions. Just as skewness, it
should be standardised with a constant depending on the number of samples. If the standardised
kurtosis is outside the limit ±2, it is considered to be extreme.
60 STATISTICAL DISTRIBUTION MEASURES

4.6 Summary table of statistical distribution measures

As a first general description of a data set, a summary table of some (or all) of the above
distribution measures is often needed. Table 4.1 shows such a table for selected variables of
the Kola Project Moss data set. It is up to the individual scientist to decide which parameters,
and especially how many percentiles, are included in such a table. According to the above
discussions, the variable name, minimum and maximum values, a central value (the MEDIAN
will usually be the most suitable choice when working with “real” data), and a measure of spread
(if the MEDIAN is provided as the central values MAD or CVR will be the best measures of
spread) should be provided as a minimum. Number of samples, 2nd and 98th percentile, and
the IQR are also of considerable interest. For environmental or geochemical data, the unit of
the variable and the lower detection limit should also be included.
Table 4.1 shows that for the Moss data set a detection limit problem exists for quite a number
of variables (minimum value < value of the detection limit, e.g., Ag, B, Bi, Cr, Na, Tl, U). For
these variables the number (or percentage) of samples below the detection limit is important to
know – if it is not provided in the summary table, it is important to consult the CP-plot of these
variables to determine the actual percentage. MEDIAN and G (also called “MEAN-log”, as it
is equal to the back-transformed arithmetic mean of the log-transformed data) are often quite
comparable and in the majority of cases considerably lower than the MEAN. This is a clear
indication of right-skewed distributions. For the same reason the values for SD are usually much
higher than those for MAD and IQR. The table clearly indicates that MEAN and SD do not
provide realistic measures of location and spread for most of the data. MEAN and SD are used
to calculate CV. It is thus no surprise to note the major difference between CV and its robust
counterpart CVR. For the data at hand, the CVR is the most reliable estimate of the coefficient
of variation. The elements Ni, Na, and Co show the highest CVR. The distributions of Ni and Co
are strongly influenced by the emissions from the Russian nickel industry, the distribution of Na
is strongly influenced by the steady input of marine aerosols along the coast of the Barents Sea.
Variables with a high CVR may thus suggest the existence of an unusual process in a survey area.
In Table 4.1 all variables are reported in the same unit (mg/kg). Many scientists use different
units to provide a more comparable data range for the variables. This is quite convenient for
tabulating data because all numbers can be shown in the same format and kept short. However,
for comparisons between variables, it is always necessary to consider two different pieces of
information, the data value and the unit, at the same time. The direct comparability of variables
is thus limited. For computing and graphical data analysis, the “beauty of the number” is of
no importance. Many data analysis systems do not permit storing and using both the unit and
the value of the variables. Unit and values can thus become separated, and it can be quite
difficult to “guess” a unit for a variable. For practical purposes and direct comparability of the
values it is thus a good idea to reduce all values to just one unit wherever possible. In applied
geochemistry the most convenient data unit is mg/kg, and to avoid confusion in case the unit
is lost at some stage of data processing, it may be a good idea to keep all values in all data files
in just this one unit (see discussion in Chapter 2).

4.7 Summary

A number of statistical parameters are available to characterise a distribution. For the “classical”
normal distribution arithmetic mean and standard deviation are best suited. Environmental
Table 4.1 Summary table providing a selection of important statistical parameters to describe selected variables of the Kola Project Moss data set.
DL MIN Q0.05 Q1 MEDIAN MEAN-log MEAN Q3 Q0.95 MAX SD MAD IQR CV % CVR %
Ag 0.01 < 0.01 0.014 0.023 0.033 0.036 0.050 0.052 0.144 0.824 0.061 0.019 0.022 121 58
Al 0.2 34 92 141 193 215 300 285 768 4850 458 90 106 152 46
As 0.02 0.037 0.08 0.126 0.173 0.195 0.260 0.268 0.71 3.42 0.301 0.085 0.105 116 49
Bi 0.004 < 0.004 0.005 0.012 0.018 0.018 0.027 0.029 0.076 0.544 0.041 0.012 0.013 151 66
Ca 20 1680 2040 2360 2620 2677 2740 2940 3764 9320 681 415 430 25 16
Cd 0.01 0.023 0.042 0.069 0.089 0.095 0.115 0.121 0.259 1.23 0.111 0.036 0.039 96 40
Co 0.03 0.11 0.17 0.24 0.40 0.51 0.92 0.89 3.4 13.2 1.478 0.304 0.48 161 77
Cr 0.2 < 0.2 0.28 0.43 0.60 0.68 0.93 0.99 2.7 14.4 1.132 0.334 0.415 121 56
Cu 0.01 2.6 3.6 4.7 7.2 9.70 17 16 60 355 28 4.65 8.35 167 65
Fe 10 47 86 145 212 252 386 384 1396 5140 545 128 177 141 60
Mg 10 518 773 918 1090 1100 1132 1270 1691 2380 282 260 261 25 24
Mn 1 29 126 290 433 388 444 577 797 1170 204 214 212 46 49
Mo 0.01 0.016 0.039 0.058 0.080 0.087 0.107 0.120 0.27 1.08 0.096 0.037 0.046 90 46
Na 10 <10 24 42 71 77 107 136 305 918 99 53 69 92 75
Ni 0.3 0.96 1.46 2.27 5.39 7.12 20 18 81 396 41 5.43 11.84 209 101
Pb 0.04 0.84 1.59 2.29 2.98 3.02 3.34 3.84 5.8 29 2.06 1.13 1.15 62 38
S 15 543 708 792 863 877 888 952 1140 2090 154 119 119 17 14
Sb 0.01 0.011 0.021 0.030 0.041 0.043 0.052 0.056 0.125 0.623 0.045 0.018 0.019 87 43
V 0.02 0.28 0.67 1.08 1.60 1.75 2.58 2.54 7.5 84 4.58 0.91 1.08 178 57
Zn 1 11.7 19 26 32 32 34 39 54 82 11 9.4 9.5 32 29
DL: value of the lower detection limit; MIN = minimum value; Q = quantile, Q1= 1st quartile, Q3 = 3rd quartile; MEAN-log = mean of the log-transformed data back
transformed (equals G, the geometric mean); MAX = maximum value reported; SD = standard deviation; MAD = median absolute deviation standardised to conform with the
normal distribution; IQR: interquartile range standardised to conform with the normal distribution; CV = coefficient of variation (per cent); CVR: robust coefficient of variation
(per cent). n = 594, all values in mg/kg, no missing values.

61
62 STATISTICAL DISTRIBUTION MEASURES

data will usually deviate from the normal distribution and it is better to routinely use robust
counterparts, e.g., MEDIAN and median absolute deviation (MAD).
A good summary table of the variables contained in a data set will provide auxiliary in-
formation in addition to the “usual” statistical parameters, e.g., number of values, number of
missing samples (if any), the measurement unit(s), the detection limits. A comparison of “clas-
sical” and “robust” parameters will prove informative. If these clearly deviate there is good
reason to take great care during any subsequent statistical analysis of the data. Such deviations
indicate the presence of data outliers and/or skewed data distributions.
5
Mapping Spatial Data

As mentioned in the introduction, most data sets in applied earth sciences differ from data col-
lected by other scientists (e.g., physicists) because they have a spatial component. They present
data for individuals/samples that were taken somewhere on Earth. In addition to analytical re-
sults, the samples are characterised by coordinates. During data analysis of environmental
and earth science research results, this spatial component must be included. To effectively
study the spatial component, the data’s statistical structure must be shown in appropriate maps
(Reimann, 2005).
Many studies investigating the regional distribution of chemical elements in different sample
materials have been prepared during the last 40 years. Geochemical atlases or collections of
regional distribution maps of selected chemical elements are available for many countries
(see, e.g., Webb et al., 1964, 1978, Webb and Howarth, 1979, Weaver et al., 1983, Shacklette
and Boerngen, 1984; Fauth et al., 1985; Bølviken et al., 1986; Jianan, 1989; Thalmann et al.,
1989; Lahermo et al., 1990; Kolijonen, 1992; McGrath and Loveland, 1992; British Geological
Survey, 1993; National Environment Protection Agency of the People’s Republic of China,
1994; Rühling, 1994, Lis and Pasieczna, 1995; Lahermo et al., 1996; Mankovská, 1996; Rapant
et al., 1996; Rühling et al., 1996; Reimann et al., 1998a; Rühling and Steinnes, 1998; Siewers
and Herpin, 1998, Čurlı́k and Šefčı́k, 1999; Kadunas et al., 1999; Li and Wu, 1999; Rank et
al., 1999; Ottesen et al., 2000; Siewers et al., 2000; Gustavsson et al., 2001, Skjelkvåle et al.,
2001a,b; Reimann et al., 2003; Salminen et al., 2004, 2005).
The resulting maps from some of these investigations may look informative whereas others
are featureless. Is this a question of the area or element mapped, or the area’s size? Is it just a
question of whether there are anomalies, outliers, present in the mapped area? Has it something
to do with the way the map was designed, and is it thus a question of art? Or is it possible to
make technical mistakes when preparing a geochemical map that will result in a severe loss of
information? There is of course, also, the underlying additional question of whether this might
be a consequence of poor survey design and data quality.
Considering the costs invested in sample collection and chemical analyses to make a geo-
chemical map, and particularly for producing a multi-element geochemical atlas, it is surprising
that so little attention appears to have been paid to geochemical map preparation. Reliable and
informative geochemical maps are needed in mineral exploration, in environmental studies,
and for communicating relevant geochemical findings to environmental regulators and policy
makers. Studying many existing examples, it appears that practitioners in applied geochemistry

Statistical Data Analysis Explained: Applied Environmental Statistics with R. C. Reimann, P. Filzmoser, R. G. Garrett,
R. Dutter © 2008 John Wiley & Sons, Ltd. ISBN: 978-0-470-98581-6
64 MAPPING SPATIAL DATA

and environmental sciences have not evolved a common effective methodology. Sometimes
one has the impression that the geochemical data just have been dropped onto an existing base
map. Many different mapping techniques have been suggested and tried over the past 40 years,
but no standard has emerged. In contrast, geographers pay great attention to map design and
production (see, e.g., Monmonier, 1996).
In this chapter different mapping techniques suggested over the years will be reviewed and
the resulting maps illustrated. Advantages and disadvantages of these maps for understanding
the geochemical process(es) creating the distribution patterns and for highlighting geochemical
anomalies will be discussed. In the following it is demonstrated that several techniques are
well suited to present geochemical data, while others fail.

5.1 Map coordinate systems (map projection)

Map projections are needed to transform the three-dimensional surface of the planet Earth
onto a flat, two-dimensional plane, the map. Many different map projections, and thus many
different coordinate systems, are in use in geographical mapping. Due to the fact that a three-
dimensional globe is projected onto a two-dimensional plane, all resulting maps have some
kinds of scale distortion. It is important to know in what coordinate system the locations for
the samples were recorded – and preferably that system should match the coordinates of any
background maps to be employed. Most Geographical Information System (GIS) programs
are able to transform coordinates between the different systems, but it is essential that those
systems are known so that the correct transformations may be applied. In many parts of the
world the UTM-system (Universal Transverse Mercator projection) is widely used, and co-
ordinates are reported in metres easting (X-coordinates) and metres northing (Y-coordinates)
as in the example data sets in this book. The world is divided into 60 UTM-zones, each 6◦
of longitude wide. Thus it is possible for survey areas to span two or more zones, and it is
essential that prior to map preparation, the location coordinates are all transformed to a single
zone. When a survey area spans several UTM-zones it may be preferable to use a different
projection to minimise the distortions that occur in displaying a three-dimensional surface on
a plane, in such instances Lambert Conic Projections and Albers Equal Area may be used.
When using projections for small scale maps, the projection property of equal area is impor-
tant so that biased impressions of spatial extent of patterns are not introduced. In the field
the GPS-systems now widely used for determining location will automatically switch from
one zone to another when crossing the boundary between two zones. Thus it is necessary
to record the UTM-zone of the location in addition to the coordinates themselves, this leads
to a 15 digit coordinate that with a leading minus sign for locations in the southern hemi-
sphere uniquely locates any point on the Earth’s surface to one metre. Thus a geochemical
database often requires three fields to record the location of a sample site, UTM-zone, east-
ing, and northing. Only in surveys within a single UTM-zone could the zone information
be omitted, but the zone must be recorded in the metadata for the project, and if possible,
be recorded in the variable definitions table of the database, if one is being built. It must be
remembered that it is important to record the spheroid and datum used for field coordinate
acquisition and the display map projection. For example, in North America a change has been
made from North American Datum 1927 (NAD27) to North American Datum 1983 (NAD83).
As a result a GPS receiver may be set up to generate NAD83 coordinates using the WGS84
MAP SCALE 65

spheroid, but old base maps may be for NAD27 on the Clarke 1866 spheroid; this can lead
to errors of up to 100 metres. Some organisations working internationally and in countries
with a wide east–west span may choose to record locations directly in geographic coordi-
nates, i.e. latitude and longitude. With the advent of GIS, the conversion to another coordinate
system and the use of an appropriate projection for map presentation has become a simple
task.

5.2 Map scale

One of the first questions arising when producing a map or an atlas is related to choice of scale.
In geography a long tradition of using well-defined mapping scales exists (e.g., ratio scales,
called “Representative Fractions”, like 1:10 000, 1:50 000, 1:100 000, 1:250 000, 1:500 000,
1:1 000 000, 1:10 000 000 where one unit of distance on the map relates to the stated unit of
distance on the ground), depending on the size of the mapped area and the purpose of the map.
A scale of 1:10 000 is larger than a scale of 1:250 000 because 10 1000 is a larger fraction than
1
250 000 . Public authorities commonly use large scales, in urban areas 1:5000 or larger, because
they need to show much detail in the maps (e.g., the exact location of power and water lines).
The correct scale of a map depends on the amount of detail that needs to be shown and the
amount of generalisation, either smoothing out or omissions, that is acceptable. Note that often
several scales may be needed to provide a realistic impression of both local detail and the broad
overall picture. Scale should be optimised for purpose.
Geographical atlases are often large-format books because whole countries or even conti-
nents are fitted on a single page, yet the maps still contain a wealth of information. Geochemists
appear to have somewhat misunderstood that concept and concluded that “an atlas has to have
a large format”. As a consequence many geochemical atlases are large-format books or map
collections that are awkward to use. The key factor in designing geographical maps is scale;
the map should contain just enough information so that the user can easily grasp the inherent
facts at the selected scale. Too much information and the map is hard to read and assimilate, too
little information, and it is uninteresting and not fit for purpose. Sample density is an important
factor that should determine the scale of a geochemical map. Is it, for example, important to
see the exact location of every single sample point, or rather should the representation reflect
the processes influencing the regional distribution of the mapped variable?
Geochemical surveys usually result in many maps (at least as many maps as analysed
elements and additional parameters). Although handling many maps has become much easier
with the advent of GIS, many maps still have to be compared with one another and with maps
containing additional information that may explain some of the observed patterns. Experience
has shown that A4 (210 × 297 mm) or letter (8.5 × 11 inches) size is a good practical format
for a geochemical map. This size also allows the easy use of the maps for other purposes such
as readable slide presentations. The A4 (or letter) format may not be possible for high-density
surveys where it is important to show the results for each single sample location, such as in
some detailed mineral exploration or contamination studies, or in case of unfortunate survey
design (e.g., clusters of locations arising from a high spatial sampling density versus large tracts
of land with no samples collected). Geochemists need to be aware that usually the information
value of a geochemical map increases when the scale is decreased to give the smallest map
possible that still shows all the required information.
66 MAPPING SPATIAL DATA

5.3 Choice of the base map for geochemical mapping

A geochemical map could be plotted without any geographical (or other) background infor-
mation. Sample sites with coordinates and the analytical results are sufficient to start mapping
and identifying spatial structures in the data. However, the samples were collected somewhere,
and users more often than not will want to see the geochemical structure in an informative
geographical context. The identification of the geographical location is often essential to aid
interpretation of the identified spatial data structures. Thus a suitable base map for plotting the
geochemical results is required.
A geochemical map will thus usually be superimposed on some kind of topographical
background. The choice of base map can be the reason why geochemical atlases are often so
large. However, if the base map is primarily a geographical (or geological) map, well designed
for that purpose, it contains already the maximum possible information density for that scale.
The addition of geochemical information on such a map will frequently result in a cluttered
appearance, and such maps are hard to read. As one consequence, it may be necessary to map at
a considerably larger scale than the presentation of the geochemical data alone would require.
Figure 5.1 shows two examples of likely base maps for mapping the Kola Project data. The
geological map from the Kola Project area (from Figure 1.2) is one possible choice (Figure 5.1,
upper left). The other map (Figure 5.1, upper right) offers minimal topographical information to
provide orientation about the location of the samples. In terms of interpreting the geochemical
results in a geological context, it is tempting to opt for the geological map as base map.
However, there is already so much information in the geology map that adding geochemi-
cal data on top results in a thoroughly cluttered map where neither geology nor geochemistry
remain clearly visible (Figure 5.1, lower left). Furthermore, when choosing geology as the
background, the underlying assumption is that geology is the most important geochemical
process determining the element distribution in the area. This may not be the case, as several
other important geochemical processes in the area exist that can influence the regional distri-
bution of the mapped element. In the example map it is not the Caledonian sediments along
the coast of the Barents Sea that determine the Mg concentrations in the O-horizon, but rather
the steady input of marine aerosols along the coast of the Barents Sea. When selecting an
unsuitable base map the chance of recognising such processes unrelated to the base map will
be severely limited. The right-hand maps in Figure 5.1 (upper and lower right) are thus the
much better and more objective choice for geochemical mapping. The geological background
can, of course, still be a very informative addition in an advanced stage of data analysis and
mapping.
When producing geochemical maps or atlases, the most important information is the spatial
distribution of the chemical elements studied. This statement implies that it is the geochemistry
that should govern the mapping and not geography, geology, or other ancillary information.
Geochemical maps are primarily produced to understand the processes causing the spatial
distribution patterns of chemical elements and to determine the sources that account for those
patterns, such as an unusual rock type, a mineral deposit, or contamination from human ac-
tivities. When producing a set of multi-purpose geochemical maps, information not related
to geochemistry should be avoided because it distracts attention from the primary focus of
the maps. However, a minimum of geographical information is necessary to allow for easy
orientation. More background information, such as the geology, may be added later if required
for interpretation. Geology underlaid on a geochemical map right at the beginning may dis-
tract the eye from other important patterns in the data. It may also result in a biased map
CHOICE OF THE BASE MAP FOR GEOCHEMICAL MAPPING 67

Figure 5.1 Geological map (upper left, see Figure 1.2 for a legend) and simple topographical map
(upper right) of the Kola Project area used as base maps for geochemical mapping (lower maps). Lower
maps: distribution of Mg in O-horizon soils added to base maps. A colour reproduction of this figure
can be seen in the colour section, positioned towards the centre of the book

interpretation because it suggests that the features on the geological map should determine
the distribution of the elements. For instance, if the map shows only the geological units and
not the fault structures, geochemical patterns related to faults may not be discerned. If the
main process determining the distribution of an element is non-geological, it may not even
be noticed. Such maps should not be the first product of data analysis and mapping, but are
more appropriate for an advanced stage of mapping, making full use of the power of modern
GIS. A good first geochemical map should contain enough geographical information for basic
orientation.
68 MAPPING SPATIAL DATA

A map must also have a way to determine scale and orientation (north arrow), a legend and
a text stating the element mapped, and probably some information on the sample medium,
sample preparation, and method of analysis. In some applications it may also be important to
add the date of map generation. Geochemical maps are often copied and used out of context;
thus some basic geochemical information like element name, unit, method of analysis, and
sample material may be quite useful on the same sheet so that it will not be lost during
copying. Again, the amount of information that can be presented for assimilation depends on
the map’s purpose. Note that a map appropriate for a printed atlas is often too cluttered for a slide
presentation.

5.4 Mapping geochemical data with proportional dots

For presenting geochemical data in a map, usually the element values are grouped, and the
resulting classes are then displayed on a map, either in the form of black and white or coloured
symbols (see Section 5.5).
Björklund and Gustavsson (1987) discussed the advantages of avoiding classes altogether
and using symbols that increase in size continuously in relation to the absolute measured
analytical value for placing geochemical data into a map. This led to the “proportional dot
map”, that is today one of the most popular ways of preparing black and white geochemical
maps. The resulting maps are often graphically pleasing and are easy to read.
However, an interesting-looking proportional dot map depends crucially on the scaling of
the dots. If the dots just grow linearly with the analytical values, it soon becomes obvious that
depending on the data distribution, some maps will result in clear differences in the spatial
data distribution while on others most of the dots will have almost the same size. Depending
on the distance of the maximum value from the main body of data, there may appear only
one really large dot on the map – or a whole lot of dots of almost the same size (Figure 5.2,
left).
To overcome the problems with different data distributions, an exponential dot size func-
tion determining the relative proportions of the dots was introduced. Gustavsson et al. (1997)
describe in detail an exponential dot size function, which is monotonically increasing and
empirically fitted to two percentiles (e.g., 10 and 99 per cent) of the empirical cumulative
distribution of observed values on the map area. Thus the symbol sizes for the smallest (10
per cent) and largest (1 per cent) data points are constant. In between the two selected per-
centiles the dots grow exponentially. Thus the exponential growth of the dot sizes is slow near
the detection limit (where there is often large variation due to poor analytical quality, see Chap-
ter 18) and becomes steeper towards larger values. The size function is purposely designed for
geochemical variables following a lognormal or positive skew distribution. It does, however,
have a disadvantage: it over-emphasises the highest and the lowest values, indirectly indicating
that in-between there is nothing of much importance. However, about 90 per cent of all data fits
into this “in-between” category. In general, the size function will down-weight the low values
and put more weight on the high values.
Using this technique, proportional dot maps showing clear regional patterns can be prepared
for almost any data set (Figure 5.2, right). However, because some users appear unaware of
the potential of these maps, and others may not have the software for constructing the dot
size function, proportional dot maps can be uninformative. A general problem with propor-
tional dot maps is that dots sized on a continuous scale are often difficult to distinguish from
MAPPING GEOCHEMICAL DATA USING CLASSES 69

Figure 5.2 Two maps showing the distribution of As in C-horizon soils of the Kola Project area. A map
based on a linear relationship between analytical value and proportional dot size (left) and a map of
the same data using an exponential weighting function (right)

one another in a map unless they occur in close proximity. A further disadvantage may be
that due to their optical weight, the attention of the reader is completely drawn to the high
values.

5.5 Mapping geochemical data using classes

It is often assumed that geochemical data follow a single (log)normal distribution, which
is “disturbed” by a limited number of extreme values, the outliers. Traditionally the high
extremes have been of greatest interest. The problems with this concept are discussed in a
number of recent papers (see, e.g., Reimann and Filzmoser, 2000a; Reimann and Garrett,
2005b, Reimann et al., 2005c) and in several chapters of this book. In general, geochemical
data do not follow a normal or lognormal data distribution but are poly-populational. The
resulting distribution may mimic a lognormal distribution due to the superimposition of a
number of separate distributions related to different geochemical processes. These processes
are reflected in separate data populations, i.e. sets of data with unique statistical parameters,
such as means and standard deviations.
Usually, the most dominant distributions in regional geochemical surveys are data related
to the geochemically distinct bedrock lithologies present in the survey area. Superimposed on
these rock type-related distributions are effects of secondary processes, such as anthropogenic
contamination, sea spray, or enrichment or depletion of elements due to a wide variety of causes
(pH, grain size, Fe- and Mn-oxyhydroxides, presence and amount of organic material, to name
just a few). All these processes are location dependent. Sometimes non-bedrock factors can
become so dominant that the “original” geochemical signature of different bedrock lithologies
is lost.
Thus, geochemical data do not consist of independent samples as assumed in classical
statistics, but the samples related to certain processes are linked additionally by a spatial
70 MAPPING SPATIAL DATA

dependence. Therefore the task in geochemical mapping and interpretation cannot be to just
detect some few high values, which may be indicative of a large mineral occurrence or ex-
treme contamination in a survey area. The task is rather to display these different processes
determining the data structure in map form and to detect local deviations from the dominating
process in any one sub-area (Reimann, 2005). Due to the fact that multiple processes are
involved, this may appear close to impossible at first glance. However, by splitting the data
into groups on the basis of order statistics (see Sections 5.5.2 and 5.5.3) it is possible to display
the spatial aspects of the data structure in a map such that the symbols or isopleths reflect
at least a limited number of the main processes underlying the regional distribution of the
elements.

5.5.1 Choice of symbols for geochemical mapping


Almost as many different symbol sets as there are geochemists have been used for geochemical
mapping. The lack of standardisation for symbols used for geochemical mapping has resulted
in maps that are not directly comparable, and the preferred use of proportional dot maps. These
do, however, focus on very high values and do not facilitate detecting the data structure in a
map.
It is EDA approaches that have permitted the acceptance of a standard set of symbol class
boundaries. However, within this context different practitioners have used different symbol
sets for mapping (Figure 5.3). These aim to provide an even optical weight for each symbol
in a map in order to be able to focus on data structure and not on “high” or “low” values. The
original EDA symbol set is based on the boxplot, reflecting the data structure, and permits the
mapping of five classes. Limiting the number of classes results in a rather “quiet” and relatively
easy to grasp distribution on a map. Experience teaches that seven classes is the maximum
number that should be shown in a black and white map.

EDA symbol set EDA symbol set GSC symbol set


with accentuated extreme values

Highest values +
Higher values + +

Inner values +

Lower values

Lowest values

Figure 5.3 Three possible symbol sets of use for geochemical mapping: original EDA symbol set; EDA
symbol set with accentuated upper extreme values; an alternative symbol set as used by the Geological
Survey of Canada (GSC). If desired these symbol sets can be easily extended from five to seven classes
by using an additional size-class for the outer symbols
MAPPING GEOCHEMICAL DATA USING CLASSES 71

Although today colour is most widely used for mapping, black and white geochemical maps
still have the advantage of being inexpensively copied and printed. In addition, black and white
maps introduce less perceptual bias than colour maps. This may have substantial advantages
when trying to detect geochemical processes with such a map. As a further advantage, colour-
blind individuals are able to work with black and white maps. However, for a black and white
map the choice of symbols is crucial. A black and white map can also be drawn as a contour,
isopleth map, avoiding the use of discrete individual symbols altogether or as a smoothed
surface map (see Section 5.6).
Note that the choice of symbols and their size in relation to the size of the map are critical
to map appearance. It is not a trivial task to find the optimal symbol type and size for a
map. Too many different symbols will usually result in a cluttered and thus hard to read map.
Because the symbol size must have a relation to the scale of the map, it is a clear advantage
if plotting software permits scaling a whole group of symbols at once so that the relative
size proportions between the symbols stay constant once a good set of symbols has been
determined.

5.5.2 Percentile classes


Percentiles are based on order statistics and their use does not assume any underlying data
distribution. This is a major advantage when dealing with geochemical data. However, the
geochemist is left with the decision as to how to distribute the classes over the range of
percentiles from 0 to 100. If the task is to identify geochemical processes from regional patterns
in a map, this data structure must be revealed in the map. One possibility is to spread the symbols
or colours almost evenly across the range of values, e.g., via using the 20th , 40th , 60th , 80th
percentiles (Figure 5.4, lower left). However, geochemists are often more interested in the tails
of the data distribution. To highlight the tails the 5th , 25th , 75th , and 95th percentiles can be
used (Figure 5.4, upper left and upper right). Additional classes can easily be accommodated.
A logical choice may be to include further class boundaries at the 50th and 98th (and possibly
2nd ) percentiles (Figure 5.4, lower right). An extended version of the EDA symbol set as
well as the Geological Survey of Canada symbol set as introduced above (see Figure 5.3) can
easily handle up to seven classes for mapping (Figure 5.4, lower right). Maps constructed with
percentile classes and one of these symbol sets will usually facilitate the direct recognition
of a number of major processes determining the distribution of the mapped variable in space.
A remaining problem is that when using percentiles, there is no satisfactory way to identify
extreme values or true outliers. The percentile-based map thus includes the assumption that
the uppermost, or lowermost, two, five, or ten percent of the data are outliers. This can be
justified as identifying an appropriate number of samples for further inspection (Reimann
et al., 2005c).

5.5.3 Boxplot classes


The boxplot is based on order statistics and is almost free of any assumption about data
distribution (see Section 3.5). More than 15 years ago Kürzl (1988) realised that it is also well
suited to define classes for geochemical mapping. O’Connor and Reimann (1993), O’Connor
et al. (1988) and Reimann et al. (1998a) subsequently used this technique with great success.
It should be recognised, however, that the normal distribution sneaks into the definition of the
72 MAPPING SPATIAL DATA

Figure 5.4 Geochemical maps of the variable As in the Kola C-horizon based on different percentile
classes and using EDA (upper left), GSC (upper right), extended EDA (lower left), and accentuated EDA
(lower right) symbol sets

boxplot when the whiskers, the borders for extreme values, are computed. It was demonstrated
above (see Section 3.5) that the boxplot will not recognise lower extreme values and will
seriously overestimate the number of upper extreme values when the data distribution is right
skewed (the most usual case in applied geochemistry). If identifying “too many” upper extreme
values is desirable, one can continue using the original Tukey boxplot with the “raw” data.
The boxplot then provides a simple and fast method to define class boundaries for mapping.
If the task is to really study the data structure in a map, it may be preferable to use percentiles,
a version of the boxplot based on percentiles, or a version of the boxplot that takes care of the
vulnerability of the original boxplot to skewed distributions (e.g., the log-boxplot).
Five main classes result from using the boxplot for class selection for mapping: lower
extreme values to lower whisker, lower whisker to lower hinge, the box, containing the inner
MAPPING GEOCHEMICAL DATA USING CLASSES 73

50 per cent of data (the box can be divided into two classes if needed), upper hinge to upper
whisker, and upper extreme values. A set of black and white symbols (see Figure 5.3) based
on this EDA approach can be used to display these five classes on a map. These symbols are
based on the concept that to show the data structure on a map objectively, each class should
have an even optical (graphical) weight. This is achieved by using large symbols for the lower
and upper extreme values. These symbols will not dominate the map, because there are usually
not a large number of extreme values. At the same time, the symbols of circle (“”) and cross
(“+”) or square (“”) are almost intuitively interpreted as low and high. The next classes at
both ends of the distribution, consisting of ≤ 25 per cent of the data each, employ smaller
versions of the extreme symbols. The inner 50 per cent, and thus the majority, of the data have
the smallest symbol, a dot. However, in some instances the dots are hard to see, and in such
instances a small cross can be used to advantage (GSC symbol set, Figure 5.3).
Maps constructed using these symbol sets look unusually “calm” at first glance, maybe even
lacking in information content, to someone used to reading maps where the eye is automati-
cally drawn to the high values. Their real power lies in the fact that the spatial data structure
becomes visible (Reimann, 2005). The underlying geochemical processes usually determine
the spatial data structure. Such maps can thus be used to understand and interpret the main
processes governing the geochemical distribution in space. In addition, even locally unusual
data behaviour which is not marked as “extreme” can be easily recognised in such a map in
the form of the occurrence of different symbols in an otherwise uniform area. Furthermore,
via the boxplot an automated check for extreme values is performed (based on an assumption
of log-normality), which is the main advantage of the boxplot over percentiles. Note that for
this to be effective, a symmetrical boxplot that has been adjusted to handle right-skewed data
distributions should be used, for example by employing a log-transform, or a large number of
upper extreme values will be displayed. Instead of the EDA symbols, colour classes can also be
used. With the exception of the rare situations where no outliers exist, colour maps constructed
using boxplot classes will in general look similar to maps constructed with percentile classes
with a symmetrical distribution of symbols about the MEDIAN. Figure 5.5 shows the As dis-
tribution in the C-horizon of the Kola Project area based on boxplot classes. Figure 5.5 (right)
uses the original EDA symbols with an accentuated outlier symbol. To satisfy the geochemists
quest for the highest value, this symbol grows continuously in direct relation to the analytical
result.
Considering the major advantages of boxplot classes in combination with EDA symbols
for black and white mapping, it is surprising that the technique has found so little application
in applied geochemistry. One reason is probably that other maps, such as proportional dots,
look much simpler at first glance. Another may be the lack of available software to prepare
such maps. Many geochemists still think in terms of “high” (= interesting, may indicate a
mineral occurrence, or contamination in environmental sciences) and “low” (= useless, no
mineralisation in these areas, or background in contamination studies), and they seek a map
that simply displays the range of the data in a relative manner. However, EDA maps achieve
the task of spatially locating high and low extreme values even better, by displaying symbols
indicative of the extent of “extremeness” in terms of the data distribution itself. Other reasons
may be that EDA symbols need appropriate scaling in relation to the other features or symbols
displayed on the map (Kürzl, 1988) and that hardly any software is available that permits this
approach. An additional reason may be the overestimation of the number of upper extreme
values by the original Tukey boxplot when care is not taken to use a transform that brings the
data into symmetry.
74 MAPPING SPATIAL DATA



Boxplot class selection As [mg/kg] ●

Boxplot class selection As [mg/kg]
EDA symbolset ●
31 EDA symbolset ●
31
N N
● ●
● ●


(accentuated) ●





9.6 ●



9.6

log−scale

log−scale
● ●

● ●

● ●

● 1.2 ● ●

● 1.2
● ● ● ●
● ●
● ● ● ●
● ●

● ●
● ●


0.3 ●

0.3
●●● ●●●
● ● ● ●
● ● ● ● ● ●
● ●
● ●

● ●
● ● ● ●

● ●
● ●
● ● ● ● ● ● ● ● ● ●
● ●

● ● ● ●
● ● ● ●
● ●

●● ●

● ●




0.05 ●● ●

● ●




0.05
● ●● ● ●●
● ● ● ● ● ●
● ● ● ●
● ●

● ●
● ● ● ●

● ●
● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ●

● ●
● ●

● ●
● ● ● ●

● ●
● ● ● ●

● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●

● ●
● ●

●● ●●
● ● ● ●

● ●
● ● ● ● ● ●

●● ●●
● ● ● ●

● ●
● ●

● ●

● ●
● ● ●
● ● ● ●

● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●

● ● ● ●
● ● ● ●

● ●
● ● ● ● ● ●
● ● ● ● ● ●


● ●

● ● ● ●●● ●● ●● ●
● ●
● ● ● ●●● ●● ●● ●

● ● ● ●

● ●● ● ●●
● ●

● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ●

● ● ● ●
● ● ● ● ● ●

● ●●● ● ● ●●● ●
● ● ● ●
● ● ● ● ● ●
● ●

● ●● ● ●●
● ● ● ●
● ● ● ● ● ● ● ●
● ●

● ●
● ● ● ● ● ●

● ●●●● ●● ●● ● ●●●● ●● ●●
● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
● ● ● ● ● ●

● ●
● ● ● ●
● ● ● ● ● ●

● ●
● ●


● ●

● ● ●



● ●

● ● ●

● ●
● ●

● ●
● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ●

● ● ●● ● ● ● ● ●● ● ●
● ● ● ● ● ●

●● ●●
● ● ● ● ● ●

● ●
● ● ● ●
● ●
● ● ● ●

● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ●
● ●

● ●
● ●

● ●
● ● ● ●
● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ●

● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ●

● ●
● ● ● ●

● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●

● ●● ● ●●
● ●

● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ●

● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ●

●● ●●
● ●
● ●
● ● ● ● ● ●
● ● ● ●

● ●
● ● ● ●
● ● ● ●
● ●

● ● ●● ● ● ● ●● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●

● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ●

● ●
● ● ● ● ● ●
● ● ● ●
● ●
● ●

● ●
● ● ● ●
● ●
● ● ● ●
● ● ● ●

0 50 100 km 0 50 100 km

Figure 5.5 Boxplot class-based maps showing the distribution of As in C-horizon soils of the Kola
Project area. Tukey boxplot (with log-scale) classes and original EDA symbol set (left), Tukey boxplot
classes (log-scale), and accentuated EDA symbol set (right)

5.5.4 Use of ECDF- and CP-plot to select classes for mapping


One of the best procedures to study geochemical distributions graphically is to plot the empir-
ical cumulative distribution function (ECDF-plot) and cumulative probability plot (CP-plot)
(Sections 3.4.2 and 3.4.4) and seek breaks or changes of slope in the plots that can be used
to identify useful classes for geochemical mapping. The procedure is discussed in detail in
Reimann et al. (2005c). Working with ECDF- and CP-plots requires experience and detailed
study of data distributions prior to mapping. It is thus a technique that an experienced applied
geochemist will use in a second step of data analysis, once the first set of maps using a standard
default technique (preferably percentile or boxplot classes with EDA symbols) as described
above has been prepared.

5.6 Surface maps constructed with smoothing techniques

To construct surface maps, different techniques can be used. Some form of moving average
(or median) and more general interpolation techniques or kriging (Section 5.7) are most often
applied. Surface maps are needed because most users (and regulators and policy makers)
want maps that look “pleasing”. For a trained geochemist, hoping to extract knowledge about
geochemical processes governing the distribution of any one element in space, surface maps
have, however, several disadvantages. Firstly, information on local variability is lost. Secondly,
depending on the algorithm (e.g., search radius, distance weighting), maps having different
appearances will result. Thirdly, the untrained user may gain from surface maps the impression
of a much greater “precision” than the few samples it is based on will support.
Most applied geochemical data are spatially discontinuous, exceptions being anthropogenic
contaminants that do not occur naturally and are dispersed from point sources, with levels con-
trolled by geological lithologies and other environmental factors. Very detailed mapping shows
SURFACE MAPS CONSTRUCTED WITH SMOOTHING TECHNIQUES 75

step discontinuities associated with these features, not smooth gradients as are expected from
geophysical maps where the properties of a “unit” are exerted beyond its physical extent.
Therefore it is usually best to prepare surface maps at scales four to five times smaller than
a survey’s working scale so that the generalisation is acceptable in the eyes of users knowl-
edgeable in the underlying discontinuities. This is particularly important in surveys employing
drainage basin sediments, where the recorded sample site location is not the location that data
conceptually represent. It must always be remembered that surface maps based on sparse data
can be misleading as their preparation requires extrapolation based on some mathematical
model that may not reflect the geochemical reality between the widely spaced data points.
The simplest method to construct a surface map is based on moving averages (for geochem-
ical data the moving median is the preferable choice). For constructing a moving median map
a window of a fixed size is defined. This is quite similar to smoothing lines (Section 6.4), but
the window is now defined by the x- and y-coordinates and is thus two-dimensional. The size
of the chosen window will have a major influence on the degree of smoothing. The window
is moved across the map, and all data points falling into the window are used to estimate a
value at the centre of the window. This value could simply be the MEDIAN of the points
within the window or some kind of weighted median, taking the distance of each point from
the centre of the window into consideration. The data points on the irregular sampling grid
are thus used to estimate new values on a regular grid. The resolution of the resulting map
will depend on the chosen raster size. The choice of the raster size, i.e. the intervals in the x
and y directions at which the estimator is computed, may be limited by the complexity of the
algorithm, i.e. computing time.
More complex smoothing methods use polynomial functions to interpolate irregularly
spaced data onto a regular grid (see, e.g., Akima, 1996). These methods will as a rule re-
sult in smoother surface maps than moving median techniques. The selection of classes for
smoothed surface maps is a crucial step in providing meaningful geochemical maps. Theoret-
ically, an endless grey (or colour) scale could be used for a surface map, with its own grey (or
colour) value for each data value, avoiding classes altogether and thus adopting the philosophy
of the “proportional dot maps” (Section 5.4). In practice, the same problems as discussed for
the proportional dots (Section 5.4) occur, and the appearance of such a map is highly dependent
on the data distribution. Thus some kind of scaling to avoid a too-strong influence of outliers
and extreme values is needed (just like the “dot size function” – Section 5.4). This can be
achieved by using, e.g., 100 percentile classes, each with its own grey value (Figure 5.6, left).
If percentiles are used for defining a preset number of classes, the chosen classes must have a
relationship to the data structure (Reimann, 2005).
This can be achieved in the majority of cases when the boxplot is used to define the class
boundaries (Kürzl, 1988), resulting in five (or seven) classes for the map (see above, Section
5.5.3). Of course, instead of using the boxplot for class selection, it is also possible to directly
use percentiles, e.g., the 5th , 25th , 50th , 75th , 90th and 95th , as class boundaries (Figure 5.6,
right). Surface maps will usually look better the more classes are applied (Figure 5.6, left);
using percentiles to define a more limited number of classes may sometimes facilitate the
detection of data structure and thus processes in the maps. It is of course also possible to
carefully select class boundaries on the basis of the inspection of CP- and ECDF-plots as done
for symbol maps. When a limited number of fixed percentiles are used, care must be taken
not to use only one class for the lowermost 50 per cent, or even 75 per cent, of the data, and
then to use all the remaining classes for the high end of the distribution. Such an approach will
make it impossible to discern the data structure, though such a map might be attractive to some
76 MAPPING SPATIAL DATA

Percentile mg/kg Percentile mg/kg


As 100 4.42
As 100 4.42
in C−horizon in C−horizon 95 1.95
N 75
50
1.07
0.74
N 90
75
1.61
1.07
50 0.74
25 0.59 25 0.59
5 0.37
0 0.27 0 0.27

0 50 100 km 0 50 100 km

Figure 5.6 Smoothed surface maps of the variable As in the Kola C-horizon; left: constructed using
a “continuous grey scale” (100 percentiles), right: the As element distribution map constructed using
seven selected percentile-related classes

mineral prospectors, or with data that have a very large proportion of “less than detection limit”
values.

5.7 Surface maps constructed with kriging

In general, kriging is a more informative approach for generating surface maps from point
source data than smoothing (Section 5.6) because it will deliver a statistically optimised esti-
mator for each point on the grid selected for the interpolation (see, e.g., Cressie, 1993). As an
additional advantage, kriging will provide an approximation of the prediction quality at each
point within the grid. It is possible to either estimate the values for the blocks defined by the
intersections of the grid lines (block kriging), or for any location in space (point kriging). The
basic idea behind spatial analysis and kriging is that because samples taken at neighbouring
sites have a closer relationship to each other than to samples collected at more distant sites the
closer samples should have greater influence on the interpolated estimate. The question then
becomes, how to set the weights proportional to distance.

5.7.1 Construction of the (semi)variogram


The basis of kriging is the (semi)variogram (see, e.g., Cressie, 1993), which visualises the
similarity (variance) between data points (y-axis) at defined distances (x-axis). The distance
plotted along the x-axis in the (semi)variogram is the Euclidean distance (distance of the direct
connection) between two points in the survey area. Usually about 30 to 50 per cent of the
maximum distance in a survey area is used as a maximum for the x-axis of the (semi)variogram.
One reason for this choice is that the variability of the difference in the element concentrations
of very distant samples can be expected to be high (approaching the total variability of the
SURFACE MAPS CONSTRUCTED WITH KRIGING 77

Figure 5.7 Visualisation of the basic information for the computation of the semivariogram for the
variable As in the Kola Project C-horizon

element in the survey area). For the Kola Project area the maximum distance is about 700 km.
For plotting the (semi)variograms, a maximum distance of 300 km was chosen (43 per cent)
when preparing the Kola Atlas (Reimann et al., 1998a) and is thus used here, although a shorter
distance, e.g., 200 km, would in many cases be more appropriate.
The basis for the calculation of the (semi)variogram is the semivariogram cloud shown in
Figure 5.7 (upper left). The semivariogram cloud displays the squared difference of the element
concentrations between all possible pairs of data points. The pairs are plotted according to
their separation distance along the x-axis. The y-axis records the semivariance, which is half
of the value of the squared differences between the values of the pairs. The semivariance is
used because it allows an easier visualisation of the total variance in the semivariogram. The
values along the x-axis can then be summarised in distance classes. Such a summary is shown
using boxplots in Figure 5.7 upper right. The MEDIANS displayed in the boxes increase
with distance. The plot shows many outliers, i.e. pairs of data points that have an unusual
large variability. Instead of summarising in boxplots, it is also possible to smooth the data
along the x-axis. Figure 5.7 lower left shows the smoothed line for the semivariogram cloud
in Figure 5.7 upper left. Again the increase of the semivariance with increasing distance is
visible. At a certain distance the smoothed line flattens out, i.e. the variance becomes constant.
78 MAPPING SPATIAL DATA

From this distance on, the concentrations will no longer show a spatial dependency. For small
distances the semivariance is small and thus the spatial dependency between the points is
high.
Progressing from the very noisy semivariogram cloud to the boxplot presentations, the trend
is becoming smoother and smoother. However, for finally constructing a semivariogram, an
even smoother appearance is desirable. For this it is necessary to leave the semivariogram
cloud and to no longer look at the pairs of all data points but to summarise more points using
distance intervals. The squared differences of the element concentrations for all points within
the distance interval (20 km) are averaged. This procedure results in the few points shown in
Figure 5.7 lower right, which are then the base for fitting the semivariogram model.
In the above example distance intervals in any direction were considered (omnidirectional).
In practice, it may also be interesting to study the semivariance in certain directions because the
spatial dependency could change with direction. Thus for a two-dimensional grid, the distances
are calculated in a certain direction from each data point, e.g., north–south or east–west. If the
grid is irregular, a tolerance angle for the defined direction is needed to find enough points in
the selected direction. If a small angle is chosen only a few points may fall into the segment at
small distances. This will cause a noisy semivariogram. If the angle is chosen to be very large,
directional differences will be averaged and thus no longer be visible in the semivariogram.
An often-used default angle to define a segment is π/8 radians (22.5 degrees). For calculating
the semivariogram function for a segment, the average squared differences of the element
concentrations for all sample pairs within that angular relationship from one another are used.
Figure 5.8 left shows the semivariances for the variable As in the Kola C-horizon data for
four different directions additional to the omnidirectional semivariogram. Although all curves
start at about the same point, they flatten out at different distances and show a different total
variance. For kriging a single model has to be fitted to the semivariances. In the example plot
(Figure 5.8, left) the omnidirectional semivariogram may be the best compromise for fitting
the data (Figure 5.8, right).

As in C−horizon As in C−horizon
1.5

1.5

● Range = 187 ●
● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ●
Semivariogram

Semivariogram

● ●
1.0

1.0

● ● Sill = 0.7

● ●

● ●
0.5

0.5

● omnidirectional
N−S
Nugget variance = 0.59
E−W
NW−SE
NE−SW
0.0

0.0

0 50 100 150 200 250 300 0 50 100 150 200 250 300

Distance [km] Distance [km]

Figure 5.8 Directional semivariograms (left) for the variable As in the Kola Project C-horizon samples
and spherical semivariogram model for the omnidirectional semivariogram (right)
SURFACE MAPS CONSTRUCTED WITH KRIGING 79

For fitting a model to the semivariogram there are different options. The curve can be
approximated by a spherical model (most common case) as shown in Figure 5.8 (right). Linear,
exponential, Gaussian, and several other models can also be used to fit the semivariogram
function. The final fitting should be based on a visual inspection of the semiovariogram.

5.7.2 Quality criteria for semivariograms


There are a number of central terms used in describing a semivariogram. The range (187 km)
defines the distance where the samples can be considered as independent from each other.
This is the distance to the point where the semivariogram model flattens out in x-direction
(Figure 5.8 right). This is also the point where the model reaches the total variance of 1.2
(in y-direction). The range relative to the maximum distance is an indication of the data
quality. A very short range is an indication that the area is undersampled or that the data
quality is very poor. The prediction quality improves with an increasing range because the
dependency between the samples increases, and thus a larger number of samples are used in the
interpolation.
The nugget variance (often named the nugget effect) is the variance at the distance zero.
In theory it should be zero. However, even for two samples taken at very close distance there
will always be several causes for a certain variation, e.g., the sampling error and the analytical
error, but also that no two absolutely identical samples can in practice be taken. The name
“nugget effect” originates from the assays in gold deposits, where a very high variation between
neighbouring samples can be caused by the existence of single gold nuggets. If the range is
zero, the semivariogram consists only of the nugget effect, indicating completely independent
samples, and no sensible spatial prediction for generating a surface map is possible. In applied
geochemistry it will be an indication that the sampling density was too low (sparse) or of very
poor analytical quality.
The sill (Figure 5.8 right) is the difference between total variance and nugget variance.
The proportion of the sill to the total variance should be large because this will indicate a
well-defined spatial dependency, leading to a high prediction quality.

5.7.3 Mapping based on the semivariogram (kriging)


The fitted semivariogram model can now be used for kriging, i.e. to either estimate the values
at the intersections of the grid lines, or any other point (point kriging) or for each block
defined by the grid (block kriging). Based on these estimated values, a smoothed surface map
is finally constructed. The variogram determines the weight of neighbouring samples used for
the prediction. More distant observations will in general obtain less and less weight out to the
range. For each element an individual function, the semivariogram, is estimated. All observed
values, weighted according to the semivariogram, are then used for the prediction. This is a
major advantage over other methods (e.g., moving medians), which consider only samples in
a defined search radius.
For constructing isoline maps point kriging is used, and the gridded point estimates are
passed to a contouring package. For the Kola Project data block kriging was used, permitting
the direct construction of a surface map. Figure 5.9 shows two kriged maps for As in the Kola
Project C-horizon samples. Depending on the block size chosen for kriging, the blocks will
be visible (20 × 20 km blocks, Figure 5.9, left), or the map display becomes smoother with
decreasing block size (e.g., 5 × 5 km blocks, Figure 5.9, right).
80 MAPPING SPATIAL DATA

Figure 5.9 Block kriging used to construct surface maps of the distribution of As in the Kola Project
C-horizon. Left map, 20 × 20 km blocks; right map, 5 × 5 km blocks. The semivariogram in Figure 5.8,
right, is used for kriging

The major advantage of this approach over all other available interpolation techniques
is that the semivariogram permits judgement of the data quality, or better, their suitability
for constructing a colour surface map. Via the estimated kriging variance, the reliabil-
ity of the obtained data estimate for each cell can be assessed. In terms of temporal
monitoring, it thus becomes possible to give a statistical estimate of how likely it is that
observed changes in element concentrations over time are significant. The kriging variance
can also be mapped. If regions within the survey area show an unusually high variance, they
would need to be sampled at a higher density if an improved map was required. However,
highly localised geochemical processes in an area could also be marked by an unusual local
variance. Although a combination of the geochemical map with a map of the kriging vari-
ance can clearly guide interpretation, many projects will not support such an intensive
activity.
Class selection and scaling are again important considerations when constructing kriged
maps. The data structure should become visible and thus the classes should have a direct
relation to the data structure. As discussed above (Sections 5.5.2 and 5.5.3), this is best
achieved via boxplot or percentile classes. The approach of “avoiding” classes via us-
ing a more or less continuous grey scale (see Section 5.6 and Figure 5.6, left) is also a
possibility.

5.7.4 Possible problems with semivariogram estimation and kriging


As mentioned above (Section 5.7.2), it is possible that the nugget effect is almost as large as the
total variance. Figure 5.10 shows this effect for Hg measured in the Kola O-horizon samples.
Only completely independent samples would allow for a pure (linear) nugget effect model.
In regional geochemistry it is very unlikely that the samples are completely independent. The
variogram thus suggests that the sample density was not high enough to detect the regional
data structure of Hg. It is of course also possible that such a result is caused by an insufficient
analytical quality (see Chapter 18) or that the sample material O-horizon is not suited to map
SURFACE MAPS CONSTRUCTED WITH KRIGING 81

Figure 5.10 Semivariogram (left) and resulting kriged map (right) for Hg in Kola Project O-horizon
soils

the regional distribution of Hg. To be able to map, a spherical model with a small range was
fitted to the semivariogram. The resulting map (Figure 5.10, right) is very noisy. A map for
a variable with a very high nugget effect may even become less informative than a properly
prepared point source map.
A common assumption for constructing a semivariogram is that the variance in the se-
lected direction depends only on the distance between two data points but is independent
of the location of the data points. The variance at a fixed distance is thus assumed to be
the same throughout the survey area. This is not the case if there is a systematic direc-
tional trend of increasing or decreasing values in the data (see Figure 6.7 for some exam-
ples). The Kola data show such trends for some selected elements in some materials (moss
and O-horizon) with distance from coast. This would theoretically call for the use of more
advanced kriging methods (e.g., universal kriging, see Cressie, 1993). However, the stan-
dard methods can often still be successfully applied to the data. To visualise the existence of
such trends, it can be advantageous to show the directional and not only the omnidirectional
semivariograms.
Figure 5.11 demonstrates this using the variable Pb measured in Kola O-horizon soils.
The Pb concentrations show a strong north–south trend, as demonstrated in the directional
semivariogram. In north–south direction independence of the samples is not reached within
the survey area (Figure 5.11, upper left). In contrast, in the east–west direction the spatial
dependency is rather low and the nugget effect dominates (Figure 5.11, upper left). However,
it is possible to construct a kriged map based on the omnidirectional “compromise” (Figure
5.11, upper right) or on the east–west-model alone (Figure 5.11, lower left). The common
choice would probably be the omnidirectional compromise. For comparison the smoothed
surface map (Section 5.6) is shown in Figure 5.11 lower right. The maps suggest that the
omnidirectional compromise results in excessive smoothing, hiding important details of the
spatial variation of Pb.
82 MAPPING SPATIAL DATA

Figure 5.11 Directional (east–west and north–south) and omnidirectional semiovariograms for the
variable Pb measured in Kola Project O-horizon soils (upper left). A model was fitted to the east–west
and to the omnidirectional semivariogram (upper left). Kriged maps based on the omnidirectional model
(upper right), based on the east–west model (lower left) and a smoothed surface map (lower right) are
shown

5.8 Colour maps

What is the difference between colour and black and white maps? At first thought, there is
no real difference other than that most people will be attracted by colour and thus feel that
the map is “easier” to read. As a first approach, different black and white symbols may be
replaced by just one symbol that is displayed in different colours (e.g., a filled circle or a filled
square). Point source colour maps will result. However, surface maps are the true realm of
colour mapping.
COLOUR MAPS 83

Colour is usually employed to make a map visually attractive. Because there are many
different colours that can be easily distinguished, it is much less demanding to produce maps
with a lot of information in colour than in black and white. However, colour can both help and
hurt a map, and the most effective use of colour is not easy. Many colour maps look merely
pretty but are not particularly informative. In geochemistry black and white maps may be more
readily and reliably decoded than colour maps. It is often beneficial to investigate simple black
and white mapping as the first step of spatial investigation.
One of the first problems with using colour in geochemical maps is that the colours have
to be sorted somehow from low to high. There exists no logical, widely accepted colour scale.
If ten different people are given a stack of colour cards and asked to sort them from “low” to
“high” the chance is good that ten different scales will emerge. Some may order from green
to red, some from blue to red, again others may attempt to order according to the rainbow
sequence of colours, and some may use similar scales but in different directions.
In geochemistry a scale from deep blue for low to red and purple for high values is frequently
used (the “rainbow” scale), similar to a global topographic map spanning from ocean depths
to mountain peaks. Often variations arise in the “middle” with various choices for the use of
yellows and greens. When looking through a selection of geochemical atlases, it is obvious that
different colour scales have been employed, this is an unfortunate situation for the untrained
reader. Preparing maps for the same data using the full rainbow scale, firstly ordered from blue
to red, and then secondly, an inverse map, where the values are ordered from red (low) to blue
(high), a scale that is also sometimes used for pH measurements, demonstrates some of the
dangers that lie in colour mapping.
A simple, consistent grey scale, from light grey for low to black for high values may actually
be far better suited to displaying concentration differences in a contoured surface map than
a more pleasant colour scale (see Figure 5.6). Sometimes simple colour scales are used, e.g.,
from light yellow over light and dark brown to black which provides a consistent, logical
and readily comprehended ordering from light to dark or low to high. Such a map may be a
compromise between a “dull” black and white or grey-scale map and a “pretty” full colour
map. A further advantage of a grey-scale map is that it will have the same appearance to those
with colour blindness as to the rest of the population.
Dark red and dark blue on a colour map, a scheme often used in global topographic maps,
will look very similar when copied in black and white. The information in a colour map is
thus easily lost when copying or faxing a map. Colours are also bound to emotions, and colour
significance may vary culturally. Colours can thus be used to manipulate the reader and may
not be the optimal choice in an international world.
The colour red is a good case in point, in the western cultures it automatically gets more
attention than all other colours, being a signal for attention or danger. However, in eastern
cultures red signals happiness. Thus colour maps are not universally objective and attempts
to draw most attention to the high values (which are usually plotted in red) may not always
succeed with an international audience.
If a rainbow scale is used for mapping, it is also very important where, in relation to the
data set, the main colour changes occur, and class selection becomes even more important than
in black and white mapping. Colours may increase the visibility of data structure in a map –
or destroy it completely. Usually a colour map based on the rainbow scale and with the major
colour changes carefully spread over the full range of percentile classes will provide the most
informative impression. When studying such maps the reader should at least always check
where in relation to the data the break from yellow to red occurs.
84 MAPPING SPATIAL DATA

The “no classes” approach can also be transferred to colour mapping because so many
colours exist that virtually any single analytical value can have its own value in a map (see,
e.g., Lahermo et al., 1996). This has been demonstrated above using a continuous grey scale
(Figure 5.6, left) and is shown for the rainbow scale in Figure 5.12 (lower left).
In a GIS environment the use of colour on maps is especially tempting. In that case the
alert reader has to watch for inadvertent camouflage. In some cases poor contrast between
the background and geochemical colours can result in severe difficulties in extracting useful
information from a pretty coloured map.

5.9 Some common mistakes in geochemical mapping

5.9.1 Map scale


As discussed in Section 5.2, one of the most widespread mistakes is to scale a geochemi-
cal map according to some existing base map or to some nice looking scale (e.g., 1:1 000
000). A geochemical map needs to be scaled such that the inherent geochemical information
is optimally displayed. The geochemistry and not the base map information should govern
the map scale. Usually, the smallest possible map will be best suited for geochemical
interpretation.

5.9.2 Base map


A common mistake discussed in Section 5.3 is to use an existing topographical or geological
map that was already designed (and scaled) for purpose. This will lead to “cluttered” and hard
to read geochemical maps (see Figure 5.1). It may also lead to misinterpretations in instances
where the chosen background information is not the main factor determining the geochemical
distribution.

5.9.3 Symbol set


Two (three with proportional dots) “standard symbol sets” (EDA and GSC) have been intro-
duced above (Section 5.5.1, Figures 5.3 and 5.4). What can be read from a black and white
map depends crucially on the symbol set and unfortunately there are about as many different
symbols sets as there are geochemists. Figure 5.13 demonstrates how different maps can look
when the symbol set is changed and how information can actually be lost (or be manipulated)
by the choice of the symbol set (see also Figure 5.4 for different impressions gained from the
EDA and GSC symbol sets). A symbol set needs to include a certain easy to grasp inner order
as provided by both GSC and EDA symbol sets. Here readability of the map for different users
will in the end just be a question of personal taste and training. If there is no “inner order” in the
symbol set, it will be close to impossible to gain an impression of the regional data structure
in a map.

5.9.4 Scaling of symbol size


The size of the symbols in relation to the size of the map will have an important influence
on the appearance of the map (Figure 5.14). Even the relative proportions of the size of the
SOME COMMON MISTAKES IN GEOCHEMICAL MAPPING 85

Figure 5.12 Colour smoothed surface maps for As in Kola C-horizon soils. Different colour scales are
used. Upper left, percentile scale using rainbow colours; upper right, inverse rainbow colours; middle
left, terrain colours; middle right, topographic colours; lower left, continuous percentile scale with
rainbow colours; and lower right, truncated percentile scale. A colour reproduction of this figure can
be seen in the colour section, positioned towards the centre of the book
86 MAPPING SPATIAL DATA

0,5,25,50,75,95,98,100 As [mg/kg] ● 0,5,25,50,75,95,98,100 As [mg/kg]



● > 8.69 − 30.7
● ●
> 8.69 − 30.7
Percentiles ●

● ● ● ● ● Percentiles
N ●




> 4.68 − 8.69
> 1.17 − 4.68 N ●




● ●


● ●


● ● ●



> 4.68 − 8.69
> 1.17 − 4.68
● ● > 0.50 − 1.17


● ● ● ●
● > 0.50 − 1.17



● ● ● > 0.30 − 0.50 ●
● ● ●
● ● ●


● > 0.30 − 0.50

● ●
● > 0.10 − 0.30
● ●


● ● ●

● ● > 0.10 − 0.30


● ● ●
● ● ●

●●●●●●●● ●

● 0.05 − 0.10 ● ●
● ●


● ●
● ●
● ●
● ●
0.05 − 0.10
● ● ●

● ●● ● ●
●●
● ● ● ● ● ● ●● ●●● ● ● ●
● ●


● ● ● ● ● ●

● ● ●● ●● ● ● ●

●● ● ●
● ●

● ● ● ●●

● ●●
● ●● ● ●● ●
● ●

● ●







● ● ● ● ● ● ●

●● ● ●●●●
● ● ●
●● ● ●

●● ●
● ●
● ● ●

● ● ●● ●


● ● ●

● ● ● ● ●● ● ●●● ●● ●●●

● ● ●




● ●
● ●● ●● ● ●

● ● ●


● ● ● ● ●


● ● ●●
●● ● ●● ●
● ●

● ● ● ● ● ●
● ●● ●● ●● ● ● ● ●

● ●
● ●

●● ● ● ●● ● ● ● ●
● ●

●● ● ●●

● ●● ●● ●● ●● ● ● ● ● ● ● ●
● ●


● ● ● ●



●●


● ●● ●● ●● ● ● ●

● ●


● ● ● ●
● ●● ●
● ● ●

●● ● ●
● ●


● ● ● ● ●● ●●●

●●

●● ●
● ●




● ●●●● ● ●


● ● ● ● ●
● ● ●

● ● ●● ● ● ●

●●
● ●




● ●
● ● ● ● ●



● ● ●

● ● ●
● ●

● ● ● ● ●● ●

●● ●


● ●


●● ●

● ●●● ●
● ●

● ● ●

● ●
● ● ●● ● ●
● ●● ●
● ● ●


● ●
● ●●
● ● ●

● ● ●●
● ●●●

● ● ●● ● ● ● ● ● ●

● ●
● ●
● ●

● ●● ● ●● ●●●


● ●

● ●


● ●
● ●

● ● ●


●●● ● ● ● ● ● ●
● ●
● ● ●

● ●● ● ● ●●

● ●
● ● ● ● ●● ●
● ● ● ●

● ●● ● ●
● ● ●


● ● ● ●
● ●
● ● ● ● ● ●
●● ●

● ● ● ●


● ●


● ●


● ● ●
● ●

● ● ●

● ● ●● ● ●

● ●
● ●


● ●

● ●
● ● ● ● ●

● ● ●


● ●
● ● ●

● ● ●





● ● ●
● ●

● ●


● ●
● ●

● ●

● ●● ● ● ● ● ● ●
●●
● ●


● ● ●


● ● ●
● ● ●
● ● ● ●

● ●
● ●

●●




●●



● ● ●● ●

0 50 100 km 0 50 100 km

Figure 5.13 Two maps showing the distribution of As in C-horizon soils of the Kola Project area.
EDA symbol set extended to seven classes to accommodate mapping with percentiles (left map) and
arbitrary, non-ordered symbol set (right map)

symbols will influence the appearance, and in turn the interpretability, of the map (Figure 5.14).
The EDA symbol set was carefully designed so that each symbol will have the same optical
weight in the resulting map. When using the EDA symbol set, great care must be taken that
the symbols do not get too small (Figure 5.14, upper right) or too large (Figure 5.14, lower
left) in relation to the map scale. It is also very important that the dot representing the inner 50
per cent of the data does not get too much weight in the map (Figure 5.14, lower right).
Because of the many problems with different symbol sets and because of the difficulties with
producing good-looking proportional dot maps when the dot size function (Section 5.4) is not
applied, many practitioners tend to combine proportional dots with percentile classes (Figure
5.15, right). When using the exponential dot size function (see Section 5.4), the high values
are accentuated, and a very clear map results (Figure 5.15, left) – as long as the main interest
is really the high values. It is in general a good idea to map using percentile classes. However,
when combining proportional dots with percentiles classes, the visual impression of the data
structure as well as the accentuation of the high values is lost, and a rather uninformative map
may result (Figure 5.15, right). The reason is that the eye is not able to distinguish between
dots of almost the same size if they do not occur right beside one another.

5.9.5 Class selection


Often geochemists use “tidy” but arbitrarily chosen class intervals, e.g., 0.5, 1, 1.5, . . . ; or 5,
10, 20, 30, . . . ; or 10, 20, 50, 100, 200, . . . mg/kg. This approach looks orderly in the legend,
but the legend itself is not the most important part of a geochemical map. Such arbitrarily
picked classes bear no relation to the spatial or statistical structure of the mapped data. Such
classes often result in uninformative, “dull” geochemical maps that fail to reveal the underlying
geochemical processes causing regional geochemical differences. Arbitrary classes should
therefore be avoided. As a historical note, arbitrary classes were extensively used when optical
SOME COMMON MISTAKES IN GEOCHEMICAL MAPPING 87



As [mg/kg] ●

As [mg/kg]


31 ●


31
N ●

N ●
● ●

● ●




9.6 ●



9.6

log−scale

log−scale
● ●

● ●

● ●

1.2 1.2
● ●

● ●
● ●
● ●

● ●
● ● ● ●
● ●


● ●


0.3 0.3
● ●
● ●


● ● ● ●
● ● ● ●


● ●
● ●

●●
● ●


● ●
● ●

● ●
● ● ● ●



● ● ● ●

● ●

● ●
● ●

● ●

● ●
0.05 0.05

● ● ● ●

●● ●
● ●


● ●


● ● ● ● ● ●

● ●


● ●
● ●


● ●
● ●

● ●
● ●
●●
● ●
● ●


● ● ● ● ●



● ●

● ●
● ●

● ● ●


● ●


● ●


● ●

● ●

● ●

● ● ●

● ●
● ● ●



● ●

● ● ●
● ● ●
● ●


● ●


● ●


● ● ●
● ●
● ● ●
● ●







● ●

● ●

● ●

● ● ●
● ●

● ●
● ●
● ● ●


● ●
● ● ● ● ●



● ● ●

● ●● ● ● ●●


● ● ● ●



● ● ● ●

● ● ● ●
● ● ●

●●

● ● ● ●
● ● ●
● ●


● ●


● ● ● ●



● ●

● ●● ●●● ● ● ● ●● ●

● ●

● ●


● ● ●

● ●

● ●
● ● ● ● ● ● ● ● ●
● ●
● ● ● ●





● ●
● ●

●●

●●

● ● ●


● ● ●



● ● ●


● ●
● ●


● ●● ●

● ●


● ● ●


● ●

● ● ●

● ●


● ● ● ● ● ●

● ● ● ● ●


● ●


● ● ●
● ●


● ● ●

● ● ● ● ●
● ●



● ●

● ●


● ●

● ●
● ● ●
● ●


● ● ●

● ● ●
● ●
● ●
● ●
● ●


● ● ●● ●


● ● ● ● ● ● ●

● ●
● ●



●● ● ● ●

● ● ●


● ●

● ●
● ● ● ● ●


● ● ●
● ● ● ● ●


● ●



● ●



● ●



● ●
● ●
● ●


● ●


● ● ● ● ●



● ● ● ●
● ●

● ● ● ●



● ●




● ● ● ● ●

● ●
● ● ●
● ●

● ● ● ●


● ●



● ● ● ●


● ● ●




● ● ●

● ● ●

● ●
● ● ● ●


● ● ●


● ●
● ● ●

● ● ● ● ●




● ● ●

● ●

● ● ● ●



● ● ● ●



● ● ● ●
● ●

● ●
● ●

● ● ●



● ● ● ● ● ● ●
● ●



● ● ● ● ●





● ● ● ● ●


● ●

● ●


● ●



● ●



● ●


● ● ● ●

● ● ●


● ●
● ● ● ●


● ●



● ● ● ● ●






● ● ● ● ● ●

● ● ● ● ● ● ●

● ● ● ● ● ●
● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ●

● ● ●

● ●


● ● ●
● ●
● ● ● ● ● ●

● ●

● ●


● ●
● ● ●

● ●

● ●


● ●

● ●

● ● ●



● ●



● ● ● ●

0 50 100 km 0 50 100 km

As [mg/kg] As [mg/kg]

● ● ●


31 ●

31

N ●
● ●




9.6 N ● ●





9.6
log−scale

log−scale



● ●

1.2 ● ●


1.2


● ● ● ● ●

● ●
● ●


● ●
●● ●



0.3 ● 0.3
●●●

● ● ●●
● ●

● ●
● ● ●

● ● ● ●● ●





● ● ● ● ●
● ● ● ● ● ●●
●● ●

●● ● ●
● ●

● ● ●●
●● ●


● 0.05 ● ●

●● ● ●● ● ● ● ●
0.05
●● ●●
● ● ● ● ●
● ●
● ● ● ●●
●● ●

● ●●
● ● ●

● ●● ● ● ●●
● ● ●
● ● ●



●●


● ●
● ●
● ●
● ● ● ●
● ● ●● ● ● ● ●
● ● ●

● ● ● ● ● ● ●●●



● ●

● ● ● ●
● ●
● ●
●●
● ● ●

● ●● ● ●
● ● ● ●● ●
● ● ●

● ●

● ● ● ● ● ●● ● ● ●

● ●
● ● ● ● ●
● ● ● ●

● ● ●

● ● ● ●● ● ● ●●
● ●


● ● ● ● ● ● ●● ● ● ●




● ● ●●
● ●●●● ● ●
● ● ● ● ●


● ● ● ● ● ● ●●
● ●●●●
● ● ● ●

● ● ● ●

● ●

● ●
● ●


●● ● ● ●● ●●● ● ● ● ● ●● ● ●
● ● ● ● ●

● ●

● ● ● ●●
● ● ● ●
● ●●

● ● ●

●●
●●

● ●

●● ● ● ●●
● ● ●

●●


●● ●


● ●

●●
● ●
● ●
● ●●● ●●●● ●
● ● ● ●

● ●

●●
● ● ●● ●● ● ●


● ●● ●

● ● ●

● ●


● ●

● ●


●● ●

● ● ●

● ● ● ●

● ●
● ● ● ●
● ●
● ● ●
● ●

● ● ●
● ●
● ●
● ● ●● ●
● ● ●● ●

● ●


● ● ● ●

●●
● ●


● ●

● ●●
● ● ●

●● ●● ● ● ●
● ●● ● ● ●
● ●





● ● ●

● ● ●





● ● ● ● ●
● ●● ● ● ●


● ● ●
● ●●
● ● ●
● ●
● ● ●

● ●
● ●●●
● ● ● ● ●
● ●
● ● ● ● ●
● ●
● ●
● ●
● ● ●●

● ● ●
● ●● ● ●
● ● ●● ●


● ●


● ●

●●
● ● ●

● ●
● ●



● ●

●●


● ● ● ● ● ● ●

● ●

●● ● ●


● ●



● ● ●


● ●●

● ●

● ●
● ● ● ● ●


● ● ● ● ● ● ● ●
● ● ● ●●●

● ● ●
● ● ●

● ● ●


● ● ● ●
● ● ●



● ● ● ●● ●
● ● ●●
● ●




● ●
● ●

● ● ●



● ●
● ●

● ● ●
● ●● ●
● ●
● ● ●
● ● ● ● ● ● ●

● ● ●

● ● ●

● ● ●
● ● ● ●
● ● ● ● ●



● ● ●
● ● ● ● ● ● ● ● ●



● ●
● ●
● ● ●● ● ● ● ● ●

● ●
● ●
● ● ●

● ●
● ● ● ●

● ● ● ●
●● ● ● ●
● ●

● ● ●
● ● ●




● ●

● ●
● ●
● ●
● ●●

● ● ●

● ●

0 50 100 km 0 50 100 km

Figure 5.14 Four maps showing the distribution of As in C-horizon soils of the Kola Project area.
Percentile classes and the EDA symbol set are used for mapping. To get a clear visual impression of the
data distribution, the size of the symbols needs to be adjusted to the scale of the map

spectroscopy was employed as a geochemical analysis tool and the plates or filmstrips were
read by eye with a comparator. This tended to lead to clustering of the reported values around the
standards used for comparison. Thus the class boundaries were set at the mid-points between
the standards, often on a logarithmic scale, e.g., with standards at 1, 5 and 10 units, class
boundaries were set at 1.5, 3 and 7.
Another approach that often fails to reveal meaningful patterns is to base the classes in the
mean plus/minus multiples of the standard deviation. This approach is based on the assump-
tion that the data follow a (log)normal distribution. In this case – and only in this case – the
approach provides an easy method of identifying the uppermost 2.5 per cent of the data as
“extreme values”. However, as has long been recognised, most applied geochemical survey
88 MAPPING SPATIAL DATA

Figure 5.15 Maps showing the distribution of As in C-horizon soils of the Kola Project area. Propor-
tional dots according to the exponential dot size function (left) in direct comparison to growing dots
scaled according to percentile classes (right)

measurements do not follow a lognormal distribution. Some practitioners choose to assume


that geochemical data follow a lognormal distribution because this facilitates the use of classi-
cal statistics by taking the logarithms of the univariate values (Reimann and Filzmoser, 2000d).
Regrettably, modern (often “robust”) statistical procedures, that are better suited for geochem-
ical data, are often omitted from basic statistics courses, and this may explain why the “mean
plus standard deviation” approach is still so popular in geochemistry Reimann et al., 2005c).
The problem is that for data (or their logarithms) that are not normally distributed, but reflect
a whole range of different processes and thus are multimodal, the mean, and especially the
standard deviation, are not good measures of central tendency and spread (variability). For
example, all class boundaries defined by this approach are strongly influenced by the outliers.
Yet, to identify statistical outliers is often one of the reasons why the samples were collected
at the outset Reimann et al., 2005c). Geochemists should recognise the important distinction
between the extreme values of a normal (or lognormal) distribution (detected satisfactorily by
the mean ± standard deviation approach) and outliers due to multimodal distributions (e.g.,
several “overlapping” geochemical processes), where this approach is unfounded.
Even when using percentile-based classes, many geochemists accentuate differences among
the high numbers at the expense of variations at lower concentrations (Figure 5.12, lower right).
This results in a map focussing completely on the high values and neglecting the lower end
of the distribution. In some studies, e.g., for trace element deficiencies, this is completely
inappropriate. Focussing on the uppermost end of the distribution again neglects (as with
arbitrarily chosen classes) the fact that there exists something called the “data structure”.

5.10 Summary

Environmental data are characterised by their spatial nature. Mapping should thus be an integral
and early part of any data analysis. Once the statistical data distribution is documented (Chapters
3 and 4) the task is to provide a graphical impression of the “spatial data structure”. There
SUMMARY 89

are many different procedures for the production of maps, only a few provide a graphical
impression of the spatial data structure. As a first step the focus should not be on the high
values alone but rather on an objective map of the complete data distribution of all studied
variables. Classes need to be carefully spread over the whole data range, the log-boxplot or
percentiles are well suited for class selection. They need to be combined with a suitable symbol
set (e.g., EDA symbols).
If only “highs” and “lows” are of interest proportional dot maps provide nice, and apparently
“easy”, maps. Just as the statistical data distribution should be documented in a series of
summary plots the spatial data distribution needs to be documented in a series of maps of all
variables before more advanced data analysis techniques are applied. Black and white maps
are best suited for this purpose. Map scale and a north arrow should always accompany a map.
An easily legible legend is a further requirement.
More advanced mapping techniques will be used in a later stage, when surface, contour or
isopleth maps are designed for “presentation”. For this purpose point data need to be inter-
polated to surface data. Surface maps can be highly manipulative, the choice of parameters
and the choice of colour are important considerations. When kriging is used for constructing
surface maps a solid understanding of the method is required. Maps prepared by smoothing
are not based on statistical assumptions in the same way as kriged maps. If the additional
information from kriging (semivariogram, kriging variance) is not required, smoothing maps
may be the better choice.
6
Further Graphics for Exploratory
Data Analysis

6.1 Scatterplots (xy-plots)

Once there is more than one variable, it is important to study the relationships between the
variables. The most frequently used diagram for this purpose is a two-dimensional scatterplot,
the xy-plot, where two variables are selected and plotted against one another, one along the
x-axis, the other one along the y-axis. Scatterplots can be used in two different ways. Firstly,
in a truly “exploratory” data analysis approach, variables can be plotted against one another
to identify any unusual structures in the data and to then try to determine the process(es) that
cause(s) these features. Secondly, the more “scientific” (and probably more frequently used)
way of using scatterplots is to have a hypothesis that a certain variable will influence the
behaviour of another variable and to demonstrate this in a plot of variable x versus variable y.
When working with compositional data the problem of data closure needs to be considered. It
will often be a serious issue whether simple scatterplots show the true relationships between
the variables (consult Section 10.5).
Both methods have their merits, the exploratory approach facilitates the identification of
unexpected data behaviour. In pre-computer times this approach was not widely used as it
was time-consuming with the risk of time and resources expended with no reward. Even with
computers many data analysis procedures require considerable effort to draw such displays.
The process must be simple and fast so that it becomes possible “to play with the data” and to
follow up ideas and hypotheses immediately without any tedious editing of data files before
the next plots are displayed.
Figure 6.1 shows a scatterplot for Cu versus Ni in the moss samples from the Kola Project.
These two elements are of special interest because approximately 2000 tons of Ni and 1000
tons of Cu are emitted to the atmosphere annually by the Russian nickel industry, situated in
Nikel, Zapoljarnij, and Monchegorsk (see Figure 1.1 for locations).
Figure 6.1 reveals that a positive, sympathetic relationship exists between Cu and Ni.
However, there are a substantial number of samples with high analytical results for both Cu
and Ni that mask the behaviour of the main body of data. Plotting the same diagram with
logarithmic axes (Figure 6.1, right), decreases the influence of the high values and increases
the visibility of the main body of data. It can now also be seen that the spread (variance) of

Statistical Data Analysis Explained: Applied Environmental Statistics with R. C. Reimann, P. Filzmoser, R. G. Garrett,
R. Dutter © 2008 John Wiley & Sons, Ltd. ISBN: 978-0-470-98581-6
92 FURTHER GRAPHICS FOR EXPLORATORY DATA ANALYSIS

500
400

200
300
Ni in moss [mg/kg]

Ni in moss [mg/kg]

50
200

10 20
37
100

5
2
1
36
0

0 50 100 150 200 250 300 350 5 10 20 50 100 200

Cu in moss [mg/kg] Cu in moss [mg/kg]

Figure 6.1 Scatterplot of copper (Cu) versus nickel (Ni) in moss samples from the Kola Project. Left
plot: original data scale; right plot: log-scale. In the right plot two unusual samples were identified

the data is relatively homogeneous over the whole range of the data. It is also apparent that Cu
and Ni are strongly related over the whole range of data. In addition, attention is drawn to a
number of unusual samples in both diagrams that deviate from the general trend displayed by
the majority of samples. It should be possible to identify these individuals by just “clicking”
on them on the screen. Samples like numbers 36 and 37 (Figure 6.1, right) should undergo an
additional “quality check” to ensure that they are truly different and that the discordant data
are not due to a sample mix-up or poor analytical quality.

6.1.1 Scatterplots with user-defined lines or fields


It is often desirable to include a line or one or several fields in a scatterplot. The simplest
case is probably a line indicating a certain proportion between the data plotted along the two
axes. One example would be two variables where the same element was analysed with two
different analytical techniques. Here the results might be expected to follow a strict one to one
relationship. Plotting a 1:1 line onto the diagram allows for the easy detection of any deviations
from the expected relationship, and to see at a glance whether the results obtained with one of
the methods are higher than expected.
Figure 6.2 shows two such diagrams. In the left plot La was determined with two different
analytical techniques. It takes but a glance to notice that although the analyses are similar, by
and large, the analytical results obtained with instrumental neutron activation analysis (INAA)
are clearly higher than those from an acid extraction. Only INAA provides truly total La-
concentrations in the samples and thus this result is no surprise. In the right hand plot, results
for Ag in two different soil horizons are depicted (Figure 6.2). The analytical techniques are
comparable (both strong acid extractions), but the plot indicates that Ag is highly enriched in
the O-horizon and does not show any relation to the C-horizon results. This is surprising; many
applied geochemists or environmental scientists would argue that the C-horizon, the mineral
or parent soil material, will provide the “geochemical background values” (see Chapter 7) for
LINEAR REGRESSION LINES 93

5
200

2
La, C−horizon, INAA [mg/kg]

Ag, O−horizon [mg/kg]


100

1
0.5
50

0.2
20

0.05
10

5 10 20 50 100 200 5e−04 0.002 0.01 0.05

La, C−horizon, aqua regia, ICP−MS [mg/kg] Ag, C−horizon [mg/kg]

Figure 6.2 Scatterplots with 1:1 line added. Left plot: La determined with two different analytical
techniques; right plot: Ag determined with comparable techniques in two different sample materials.
Log-scales used for both graphics

the upper soil horizon. The diagram indicates that the relationship is not that simple. One could
be tempted to argue that then all the Ag in the O-horizon must be of anthropogenic origin.
There is, however, no likely major Ag-emission source in either the survey area or anywhere
near. Thus the conclusion must be that there exist one or more processes that can lead to a high
enrichment of Ag in the organic layer – independent of the observed concentrations in the soil
parent material.
Petrologists frequently use xy-plots where they mark predefined fields to discriminate
between different rock types. To include such “auxiliary” background information in a
scatterplot is desirable and feasible, but the actual definition of the field limits in the di-
agrams is a highly specialised task. One software package, based on R, that can plot
these lines and fields for many pre-defined plots used by petrologists is freely available at
http://www.gla.ac.uk/gcdkit/ (Janousek et al., 2006).

6.2 Linear regression lines

The discussion of Figure 6.2 demonstrated that it is informative to study dependencies between
two or more variables. In Figure 6.2 a 1:1 relation was a reasonable starting hypothesis and
thus the line could be drawn without any further considerations. However, many variables may
be closely related but do not follow a 1:1 relation (e.g., Cu and Ni as displayed in Figure 6.1).
In such cases it might aid interpretation if it were possible to visualise this relationship. In
most cases the first step will be to look for linear rather than non-linear relationships. To study
linear relationships between two or more variables, linear regression is a widely used statistical
method (see also Chapter 16). The best-known approach for fitting a line to two variables is the
least squares procedure. To fit a line, it is important to take a conscious decision which variable
is the y-variable, i.e. the variable to be predicted. The least squares procedure determines the
line where the sum of squared vertical distances from each y-value to this line is minimised.
94 FURTHER GRAPHICS FOR EXPLORATORY DATA ANALYSIS

20 000
500
200

Ca in C−horizon, Russia [mg/kg]

5 000
Ni in moss [mg/kg]

50

2 000
10 20
5

500
2

200
1

5 10 20 50 100 200 2 5 10 20 50 200 500

Cu in moss [mg/kg] Sr in C−horizon, Russia [mg/kg]

Figure 6.3 Two scatterplots with least squares linear regression lines. Left plot: Cu versus Ni, Kola
Project Moss data set with log-scale as used in Figures 6.1 and 6.2; right plot: Sr versus Ca (log-scale),
Kola Project C-horizon, Russian samples

This situation can be avoided by using less common procedures for estimating the linear trend,
e.g., total least squares or orthogonal least squares regression (Van Huffel and Vandewalle,
1991) which minimise a function of the orthogonal distances of each data point to the line.
Figure 6.3 (left) shows the relationship, regression line, between Cu and Ni in moss (Figure
6.1). When plotting a regression line onto the diagram using the log-scale it is visible that the
relation is not really linear but slightly curvilinear. However, a straight line is quite a good
approximation of the relationship between the two variables. The other diagram in Figure 6.3
shows the relation between Ca and Sr for the C-horizon samples collected in Russia (log-scale).
A number of samples with unusual high Sr-concentrations disturb the linear trend visible for
the main body of samples. These samples exert a strong influence (leverage) on the regression
line, such that it is drawn towards the relatively few high Sr values. This example demonstrates
that the least squares regression is sensitive to unusual data points – in the example the line
follows neither the main body of data nor the trend indicated for the outliers (Figure 6.3, right).
It is thus necessary to be able to fit a regression line that is robust against a certain number
of outliers to the data. This can be achieved by down-weighting data points that are far away
from the linear trend as indicated by the main body of data. The resulting line is called a robust
regression line.
Figure 6.4 shows the two plots from Figure 6.3 with both the least squares and a robust
regression line. For the left hand plot, Cu versus Ni in moss, both methods deliver almost
the same regression line, the difference in the right hand plot (Ca versus Sr in the C-horizon
of the Russian samples) is striking. The robust regression line follows the main body of data
and is practically undisturbed by the outlying values. For environmental (geochemical) data
it is advisable to always use a robust regression line, just like MEDIAN and MAD are better
measures of central value and spread than MEAN and SD.
Regression analysis can be used not only to visualise relationships between two variables but
also to predict one variable based on the values of one or more other variables (see Chapter 16).
TIME TRENDS 95

500

20000
200

Ca in C−horizon, Russia [mg/kg]

5000
Ni in Moss [mg/kg]

50

2000
10 20
5

500
2

Robust line Robust line

200
LS line LS line
1

5 10 20 50 100 200 2 5 10 20 50 200 500

Cu in Moss [mg/kg] Sr in C−horizon, Russia [mg/kg]

Figure 6.4 Two scatterplots with least squares (LS) and robust regression lines. Left plot: Cu versus Ni,
Kola Project Moss data set with log-scale as used in Figure 6.3 (left); right plot: Sr versus Ca (log-scale),
Kola Project C-horizon, Russian samples

Relationships between two variables do not need to follow a linear trend. In fact, any non-
linear function could be fitted to the data, or the data points could be simply connected by a line.
However, to be able to discern the trend, the function will often need to be smoothed. Many
different data-smoothing algorithms do exist. Usually a window of specified width is moved
along the x-axis. In the simplest case one central value for the points within the window is
calculated (e.g., the MEAN or MEDIAN). When connected, these averages provide a smoothed
line revealing the overall data behaviour. Depending on the window width, the resulting line
will be more or less smoothed. More refined methods use locally weighted regression within
the window.
In environmental sciences the most usual application of smoothed lines will be in studying
time or spatial trends as described below (Sections 6.3, 6.4).

6.3 Time trends

In environmental sciences “monitoring” is often an important task; it is necessary to determine


whether analytical results change with time. The large test data set used here comes from
the regional part of the Kola Project. It permits the study of the distribution of the measured
elements in space via mapping (Chapter 5) but not to study changes in element concentration
over time. It would become possible if the regional mapping exercise was repeated at certain
time intervals. This would be a very expensive undertaking, and it would be more effective to
collect the required data in a monitoring exercise at selected locations more frequently than to
repeatedly re-map the whole area.
For interpreting the regional Kola data set, more detailed information on local variability
versus regional variability and on changes of element concentrations with the season of the year
was needed. These data were obtained in a catchment study, which took place one year before
the regional mapping program (see, e.g., de Caritat et al., 1996a,b; Boyd et al.,1997; Räisänen
et al., 1997; Reimann et al., 1997c). For the catchment study eight small catchments were
96 FURTHER GRAPHICS FOR EXPLORATORY DATA ANALYSIS

sampled at a density of one site per km2 , and most sample materials were collected at different
times through all seasons of one year. Stream water was one of the most intensely studied
sample materials, and at some streams water samples were taken once a week throughout a
hydrological year (de Caritat et al., 1996a,b).
These data are used here to demonstrate the investigation of time trends in a special form
of the scatterplot where the x-axis is the time scale and the measured concentration of diverse
variables is plotted along the y-axis. This kind of work requires careful planning as to how the
time of sampling is recorded so that the computer software can process it. A simple trick to
overcome such problems could be to give all samples a consecutive number according to the
date of collection and select the number as the variable for plotting the time axis. If the field
sampling has not occurred at regular intervals, this procedure will lead to distortions along the
x-axis. Another procedure is to record the date as the “Julian day number”, which numbers days
consecutively from January 1, 4713 BC and is an internationally agreed-upon system employed
by astronomers. Fortunately, R is also able to handle “usual” dates in a variety of formats. To
easily discern the trends, a line should be used to link results for a particular variable. Often it is
advantageous to be able to estimate a smoothed line running through the results by a variety of
different algorithms (see Section 6.4). In the simple case of about 50 water measurements per
catchment, spread almost evenly over the year, a simple line directly connecting the results will
suffice. However, where multi-year data are acquired, some systematic periodicity, seasonal
effect, may be expected and smoothing may help identify such effects.
In Figure 6.5 it is demonstrated how pH and EC change in the course of one year in catchment
two, which is in close proximity to the Monchegorsk smelter. A drop in pH and conductivity
during May is obvious (Figure 6.5). This indicates the influence of the snow melt period on
the stream water samples.
If an abundance of data exists that exhibits high variability, it can also be informative to
summarise days, weeks, or months (maybe even years) in the form of boxplot comparisons (see
Section 8.4). It is then possible to study differences between the MEDIANS and differences in

Figure 6.5 Time trends of acidity (pH) and electrical conductivity (EC) in stream water collected on
a weekly base in catchment two, Monchegorsk
SPATIAL TRENDS 97
7 800 000

100 200
Cu in moss [mg/kg]
UTM north [m]

50
7 600 000

20
10
7 400 000

5
4e+05 6e+05 8e+05 −300 −200 −100 0 100

UTM east [m] Distance from Monchegorsk [km]

Figure 6.6 Right plot: east–west transect through Monchegorsk, showing the decrease of the concen-
tration of Cu in moss (y-axis, log-scale) with distance from Monchegorsk (x-axis). The map (left plot)
shows the location of the samples used to plot the transect

variation (and skewness) at the same time, and trends may be easier to discern when looking
at such a graphical summary instead of looking at all the data points and a trend line.

6.4 Spatial trends

Instead of looking at a geochemical map (see Chapter 5), it may also be informative to study
geochemical transects, another special case of the xy-plot, where the x-variable is linked to
the coordinates of the sample site locations. With the Kola Project data it is, for example,
informative to look at an east–west-transect running through Monchegorsk from the eastern
project border to the western project border. This facilitates a study of the impact of contami-
nation and the decrease of element concentrations with distance from Monchegorsk. Another
interesting study would be a north–south-transect along the western project boundary, in the
“background” area, to investigate the impact of the input of marine aerosols at the coast and of
the influence of the change of vegetation zones from north to south on element concentrations
in moss or soils.
For constructing such a transect, it is necessary to have a tool to define the samples belonging
to the transect subset. It is of great advantage if this can be done interactively directly in
the map. Such samples could then, for example, be identified as belonging to the subset
“EW Monchegorsk” and “NS Background”. Once the subset is selected, it is easy to study
the distribution of all the measured variables along such a transect via selecting the x- (for
the east–west-transect) or y-coordinate (for the north–south-transect) for the x-axis of the plot
and any other variable for the y-axis and plotting these as xy-plots. In instances where the
coordinates of the coastline or of Monchegorsk are known, new variables for the x-axis can
be defined “distance from coast” and “distance from Monchegorsk” via a simple subtraction.
If the transects are not parallel to the coordinate system, distances can be estimated by plane
geometry.
98 FURTHER GRAPHICS FOR EXPLORATORY DATA ANALYSIS

A trend line is then fitted to the data along the x-axis. Different smoothing techniques exist
for constructing such a line. The basic principle is to choose a window of a certain bandwidth,
which is moved along the x-axis and relevant statistics computed sequentially. Within the
window the points are either averaged or a weighted regression is performed to obtain an
average value. All average values are connected by a line. The smoothness of the line is mainly
controlled by the chosen bandwidth. A very simple method for smoothing is Tukey’s moving
median (Tukey, 1977). A more modern and prominent technique, which has been used here,
is based on local polynomial regression fitting and known as the “loess” method (Cleveland
et al., 1992).
In Figure 6.6 it can easily be seen that Cu values in moss increase by two orders of mag-
nitude when approaching the location of the Monchegorsk smelter. At the same time it is also
apparent that the values decrease rapidly with distance from the smelter. At about 200 km from
Monchegorsk the contamination signal “disappears” into the natural background variation. At
this distance anthropogenic contamination and natural background can no longer be separated
using direct chemical analysis. Although the example data display a clear trend even without
a smoothed line, the line provides a more informative picture of the overall trend without
distortion by local variation.
A north–south transect at the western project border in a pristine area without influence from
industry shows other processes (Figure 6.7). The steady input of marine aerosols from the coast
of the Barents Sea has an important influence on moss and O-horizon chemistry. Elements like
Na and Mg, which are enriched in sea water, show a steady decrease from the coast to a distance
of about 200 to 300 km inland (Figure 6.7). This indicates the important influence of a natural
process, the steady input of marine aerosols along the coast, on the element concentrations
observed in moss and O-horizon samples on a regional scale. An “exploratory” surprise was
the opposite trend as detected for Bi (and several other elements – see Reimann et al., 2000b)
(Figure 6.7). This trend follows the distribution of the major vegetation zones in the survey area
(tundra–subarctic birch forest–boreal forest) and is interpreted as a “bio-productivity” signal:
the higher the bio-productivity, the higher the enrichment of certain elements in the O-horizon.
In the examples shown so far, the trends were visible when studying the distribution of
the sample points in the figures (Figure 6.6 and 6.7) by eye. Trend lines will be even more
interesting when the data are far noisier and trends are not visible at first glance. The dis-
tribution of the pH values in the O-horizon samples along the transects provides such an
example (Figure 6.8). Along the north–south transect, pH is clearly influenced by distance
from the coast. The steady input of marine aerosols along the coast (see Figure 6.8) leads
to a replacement of protons in the O-horizon and results in a steadily increasing pH to-
wards the coast. The replaced protons end up in the lakes, resulting in lake water acidi-
fication along the coast (Reimann et al., 1999a, 2000a). Along the east–west transect one
would expect clear signs of acidification (lower pH) when approaching Monchegorsk with
its enormous SO2 emissions. Figure 6.8 (right) suggests the opposite: pH increases slightly
when approaching Monchegorsk, the anthropogenic effect of one of the largest SO2 -emission
sources on Earth on the pH in the O-horizon is clearly lower than the natural effect of
input of marine aerosols along the coastline. The fact that pH actually increases towards
Monchegorsk is explained by co-emissions of alkaline elements like Ca and Mg which more
than counteract the acidic effects of the SO2 emissions. The diagram suggests that it would
be a severe mistake to reduce the dust emissions of the smelter before the SO2 emissions
are reduced because then the positive environmental effects of these co-emissions would be
lost.
SPATIAL DISTANCE PLOT 99

500
7 800 000

400
Na in moss [mg/kg]
UTM north [m]

300
7 600 000

200
100
7 400 000

0
4e+05 5e+05 6e+05 7e+05 8e+05 0 100 200 300 400 500
UTM east [m] Distance from coast [km]
1.0

5000
0.8

4000
Mg in O−horizon [mg/kg]
Bi in O−horizon [mg/kg]

3000
0.6

2000
0.4

1000
0.2

0 100 200 300 400 500 0 100 200 300 400 500
Distance from coast [km] Distance from coast [km]

Figure 6.7 North-south transects along the western project border, the most pristine part of the Kola
Project survey area. The map (upper left) indicates the samples selected for plotting as transects. Upper
right: Na in moss; lower left: Bi in O-horizon soil; and lower right: Mg in O-horizon soil

6.5 Spatial distance plot

There may be cases where it is difficult to define a sensible transect. In such instances it could
still be interesting to study systematic changes of a variable with distance from a defined point.
This can be done in the spatial distance plot where a subarea (of course the whole survey area
can also be used) and a reference point are defined.
The calculated distance of each sample site from the reference point is plotted along the
x-axis. Along the y-axis the corresponding concentrations of the variable at the sample sites
are plotted. To better recognise a possible trend a smoothed line is fitted to the points.
Figure 6.9 shows an example of such a spatial distance plot using the variable Cu in moss
and the location of Nikel/Zapoljarnij as the reference point. The plot clearly demonstrates the
100 4.4 FURTHER GRAPHICS FOR EXPLORATORY DATA ANALYSIS

5.5
4.2

5.0
4.0
pH in O−horizon

pH in O−horizon

4.5
3.8
3.6

4.0
3.4

3.5
3.2

0 100 200 300 400 500 −300 −200 −100 0 100

Distance from coast [km] Distance from Monchegorsk [km]

Figure 6.8 pH in the O-horizon. Left plot: north–south transect. Right plot: east–west transect
through Monchegorsk

decreasing Cu concentrations with distance from the industrial centre, with background levels
being reached at a distance of about 200 km from Nikel/Zapoljarnij.
The spatial distance plot can accommodate any form and any direction of the chosen subarea.
It is, for example, easily possible to study the effects of Nikel/Zapoljarnij and Monchegorsk
at the same time. When choosing a north–west to south–east running subarea that includes
both sites, the reference point is set in the north-western part of the survey area; this results
in the plot shown in Figure 6.10. Figure 6.10 shows two peaks of Cu concentrations, the first
(left) peak is related to Nikel/Zapoljarnij, the second (right) peak to Monchegorsk. In this plot

Figure 6.9 Right plot: spatial distance plot for Cu in moss (log-scale). The location of Nikel/Zapoljarnij
(dot on left map) is chosen as reference point, the outline indicates the sample sites included in the
spatial distance plot (right)
SPIDERPLOTS (NORMALISED MULTI-ELEMENT DIAGRAMS) 101

Figure 6.10 Right plot: spatial distance plot for Cu in moss (log-scale). The subarea (left map) includes
Nikel/Zapoljarnij and Monchegorsk, and the reference point (dot) is placed in the north-western part
of the survey area; the outline indicates the sample sites included in the spatial distance plot (right)

it is clearly visible that the Monchegorsk Cu emissions travel further than the emissions from
Nikel/Zapoljarnij, as indicated by the width of the peaks. One likely explanation could be that
the emissions at Nikel/Zapoljarnij have a higher proportion of particulates.

6.6 Spiderplots (normalised multi-element diagrams)

Spiderplots, where a selection of elements are plotted along the x-axis against the ratio of the
element to the concentration of the same element in some reference value (e.g., “chondrite”,
“mantle”, “shale”, “upper continental crust”) on the y-axis, are popular in petrology (see, e.g.,
Rollinson, 1993). The technique was originally developed to compare the distribution of the
Rare Earth Elements (REEs) between different rock types. REEs with an even atomic number
show in general higher concentrations in rocks than REEs with uneven atomic numbers (Taylor
and McClennan, 1985). Through normalising REEs to a reference value, the resulting saw-
blade-pattern can be brought to a more or less straight line by plotting the ratio against the
elements. Deviations from the line are then immediately visible. Usually the elements are sorted
according to atomic number, but other sorting criteria (e.g., ionic potential or bulk partition
coefficient) are also used.
Only a few samples can be plotted before the graphic becomes increasingly unreadable; it is
thus not really suited for large data sets. The appearance of the plot will strongly depend on the
sequence of elements along the x-axis and, of course, on the selection of the set of reference
values used for normalisation. Except for the REEs, there are no standardised display orders,
a fact that limits the usefulness of these diagrams.
To display the Kola Project results in such a spider plot, the MEDIAN concentration of
selected variables per layer (moss, O-, B-, or C-horizon) could be shown. It could be argued
at length as to what reference value the results should be compared against. “Upper crust” or
“world soil” might be the most appropriate candidates at first glance. However, the average
102 FURTHER GRAPHICS FOR EXPLORATORY DATA ANALYSIS

Moss
O−horizon
20
MEDIAN material / MEDIAN C−horizon

B−horizon
5

?
?
?
? ?
?
?
? ? ?
?
1

?
? ?
? ? ? ?
?
? ?
?
?
0.05 0.2
0.005

Fe

Mg
Mn
Mo
Ca
Cd
Co

Cu

Na
Ag

As
Ba

Pb

Th

Zn
Cr

Sr
Ni
Al

Bi

V
Element

Figure 6.11 Spiderplot showing element concentrations in the B-horizon, O-horizon and moss sample
materials relative to the element concentration in the C-horizon from the Kola Project plotted in chemical
abbreviation order (from Ag to Zn)

values quoted for “upper crust” or “world soil” are true total concentrations, which are not
really comparable to the aqua regia and concentrated HNO3 -extraction data of the Kola Project.
Thus there are problems in selecting the most appropriate reference values for the display. The
upper soil horizons could be referenced to the C-horizon results from the Kola Project as the
best estimate of parent material geochemistry for the survey area (bearing the “cost” of being
“only” able to directly compare B-horizon with O-horizon and moss). It would also be possible
to use another large-scale data set where comparable analytical techniques were used (e.g.,
the Baltic Soil Survey data – Reimann et al., 2003). For the example plot the MEDIAN of
each variable in moss, O- and B-horizon was divided by the corresponding MEDIAN in the
C-horizon. The line at “1” in the plot is the reference line of the C-horizon against which the
other sample materials are compared. In the first example (Figure 6.11) the elements have been
ordered alphabetically (according to chemical abbreviation from Ag to Zn).
The second example (Figure 6.12) shows the same plot where the elements are ordered
according to a steadily decreasing ratio in moss. Although the same data are plotted, the
resulting diagram looks very different. Here it is easier to compare the differences of the other
two sample media (B- and O-horizon) from the two extremes, C-horizon and moss.
It is immediately apparent that the average composition of the B-horizon – with the exception
of very few elements – closely follows that of the C-horizon. Furthermore, it is clear that a
number of elements show unusually high concentrations in the moss, while others have much
lower values in the moss than in the C-horizon. It is also quite easy to identify those elements
that show a different behaviour in the O-horizon than they do for the moss or B-horizon.

6.7 Scatterplot matrix

The scatterplot matrix is a collection of scatterplots of all selected variables against each other
(Figures 6.13 and 6.14). The scatterplot matrix is one of the most powerful tools in graphical
TERNARY PLOTS 103

Figure 6.12 The same spiderplot as Figure 6.11, but the elements are sorted according to the con-
tinuously decreasing ratio for the moss samples relative to the underlying C-horizon

data analysis, and is an entrance point to multivariate data analysis. It can, for example, be
used to identify those pairs of elements where a more detailed study in xy-plots appears to be
necessary.
When using the scatterplot matrix with a multi-element data set, one of the major problems
is screen and paper size. The individual plots become very small on the screen if more than
10 to 12 variables are displayed. If an A0 (or 36 inch) printer or plotter is available, up to
a 60 × 60 matrix may be plotted. An important consideration before plotting a scatterplot
matrix is the handling of outliers. Extreme outliers in several variables will greatly disturb
the plot (Figure 6.13). Thus transformations need to be considered prior to plotting in order to
down-weight the influence of outliers (Figure 6.14). Alternatively samples that exhibit extreme
outliers can be removed (trimmed) prior to plotting a scatterplot matrix. Although the scatterplot
matrix is a graphic and not a “formal” correlation analysis, the special problems of the closure
of geochemical data still remain (see Section 10.5), especially for major and minor elements.
The scatterplot matrix may help, however, to visualise some of these problems right away.
Furthermore, by additive or centred logratio transformation of the major and minor element
data (see Sections 10.5.1 and 10.5.2), the “opened” data may be displayed.

6.8 Ternary plots

Ternary plots show the relative proportions of three variables in one diagram (see, e.g., Cannings
and Edwards, 1968). They have been used in petrology since the early 1900s to study the
evolution of magmas or to differentiate between certain rock types. When three variables for
a single observation sum to 100 per cent, they can be plotted as a point in a ternary plot. To
construct an informative ternary plot, the three variables should have analytical results within
the same order of magnitude, or all points will plot into the corner of the variable with the
highest values. If there are large differences between the variables, it is necessary to multiply
one or two of the variables by a factor to bring the values into the same data range as that of
the other variables. The values for all three variables are then re-calculated to 100 per cent and
104 FURTHER GRAPHICS FOR EXPLORATORY DATA ANALYSIS

Ca

Cu

Mg

Na

Sr

Zn

Figure 6.13 Scatterplot matrix of some selected elements, Kola Project C-horizon data. Axes could
be labelled, but that will result in even smaller scatterplots, which contain the important information
of this graphic. Some extreme values dominate the plot

plotted into the ternary space. In a ternary plot only two variables are independent, the value
of the third variable is automatically known; the data are closed (Aitchison, 1986), and thus
these diagrams should not be used to infer “correlations”.
However, they can be used in environmental and other applied geochemical studies to aid
“pattern recognition” and comparison with other studies.
Figure 6.15 shows two ternary plots constructed with the Kola Moss data set. The scale of
the three axes varies between 0 and 100 per cent for each element (100 per cent in the corner
of the element). The grey dashed lines in Figure 6.15 show the percentages. Just as in the
scatterplot, it should be possible to identify interesting samples (usually outliers) via a simple
click of the mouse. The Cu-Ni-Pb plot demonstrates that Pb is not an important component
of the smelter emissions, which are dominated by Cu and Ni. Sample 907 was taken close to
the Monchegorsk smelter. The Mn-Al-Fe ternary plot was constructed because both Al and
Fe concentrations will be strongly influenced by the input of dust to the moss, while Mn is a
plant nutrient. It is apparent that there are samples where Al dominates the dust component
and others where Fe is more important. In addition, it appears that the majority of samples
tend to the Mn corner of the diagram.
TERNARY PLOTS 105

Ca

Cu

Mg

Na

Sr

Zn

Figure 6.14 The same scatterplot matrix as above (Figure 6.13) but with log-transformed data to
decrease the influence of extreme values

Pb Mn
0.2

0 .2
0 .2

0.2

0.8 0.8
0.4

0.4
0.4

0.4

0.6 0.6
0.6

0.6
0.6

0 .6

0.4 0.4
97
563
0.8

0.8
0.8

0.8

40 38
0.2 0.2

726 325
907
Ni Cu Al Fe

Figure 6.15 Two ternary plots for selected elements of the Kola Project Moss data set
106 FURTHER GRAPHICS FOR EXPLORATORY DATA ANALYSIS

For petrological applications it should again be possible to display pre-defined fields


(or lines) in these diagrams that outline where rocks with particular names plot. Just
as for xy-plots, one software package, based on R, which can plot these fields and
lines for many pre-defined ternary plots (Janousek et al., 2006), is freely available at
http://www.gla.ac.uk/gcdkit/.

6.9 Summary

Once the statistical and spatial data structure is documented, the relationships between different
variables are an important avenue of study. While up to this point it was no major problem to
study and document variable by variable, now almost countless possibilities are encountered.
The number of possible plots showing bivariate data relationships increases quadratically with
the number of variables. It may thus be tempting to commence right away with multivariate
data analysis. However, all multivariate methods are built on many statistical assumptions
about the data. It is thus advisable to use scatterplots first as the basis for studying bivariate
data behaviour. Such bivariate plots used in an exploratory manner can be very informative,
especially when adding user-defined lines, regression lines and trend lines. Plotting spatial and
time trends are special cases of bivariate data analysis. The scatterplot matrix can be used to
gain a graphical impression of the relationships between all possible pairs of variables. Three
variables can be displayed in ternary plots. All of these simple plots can be used to great
advantage to gain an improved knowledge of the data and its structure, interrelationships,
before starting any formal statistical data analysis.
7
Defining Background and
Threshold, Identification of Data
Outliers and Element Sources

The reliable detection of data outliers and unusual data behaviour is one of the key tasks in
the statistical analysis of applied geochemical data, and has remained a core problem since the
early days of regional geochemistry (see, e.g., Hawkes and Webb, 1962; Matschullat et al.,
2000; Reimann and Garrett, 2005; Reimann et al., 2005).
In Chapters 3 and 4 several methods to visualise the data distribution and to calculate dis-
tribution measures have been introduced. Statistically, it should now be no problem to find the
univariate extreme values (for multivariate outlier detection see Chapter 13) for any determined
variable. To identify the extreme values in a normal distribution, statisticians usually calculate
boundaries via the formula MEAN ± 2 · SD. The calculation will result in about 4.6 percent of
the data belonging to a normal distribution being identified beyond the boundaries as extreme
values, 2.3 percent at the lower end and 2.3 percent at the upper end of the distribution. The
inner 95.4 percent of the data will represent the “background” variation in the data set. In the
early days of regional geochemistry, this formula was frequently used to identify a certain
proportion of “unusually high” values in a data set for further inspection (Hawkes and Webb,
1962). The upper border calculated by the “MEAN ± 2 · SD” rule was defined as the “thresh-
old”, the boundary between background variation and extreme values. The terms “background”
and “threshold” have often been intermingled as identifying the boundary between “normal”
and “extreme” values, but should be kept strictly separate because “background” identifies the
main range of variation in the da