0% found this document useful (0 votes)
323 views105 pages

Geostatistical Análisis

The document introduces geostatistical analysis, which uses statistics to analyze and predict values associated with spatial phenomena. It incorporates spatial coordinates into analyses and provides interpolated values as well as uncertainty measures. Geostatistics is widely used in fields like mining, environmental science, soil science, meteorology, and public health. The document then describes the typical geostatistical workflow, which involves examining data, building a model, interpolating values, and checking results. It also introduces the ArcGIS Geostatistical Analyst extension, which provides tools for surface modeling and geostatistical analysis integrated with GIS.

Uploaded by

Paloma Pozo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
323 views105 pages

Geostatistical Análisis

The document introduces geostatistical analysis, which uses statistics to analyze and predict values associated with spatial phenomena. It incorporates spatial coordinates into analyses and provides interpolated values as well as uncertainty measures. Geostatistics is widely used in fields like mining, environmental science, soil science, meteorology, and public health. The document then describes the typical geostatistical workflow, which involves examining data, building a model, interpolating values, and checking results. It also introduces the ArcGIS Geostatistical Analyst extension, which provides tools for surface modeling and geostatistical analysis integrated with GIS.

Uploaded by

Paloma Pozo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 105

Geostatistical Aná lisis

1. Introduction to Geostatistical Analisis

1.1. What is geostatistics?

Geostatistics is a class of statistics used to analyze and predict the values associated with
spatial or spatiotemporal phenomena. It incorporates the spatial (and in some cases
temporal) coordinates of the data within the analyses. Many geostatistical tools were
originally developed as a practical means to describe spatial patterns and interpolate
values for locations where samples were not taken. Those tools and methods have since
evolved to not only provide interpolated values, but also measures of uncertainty for those
values. The measurement of uncertainty is critical to informed decision making, as it
provides information on the possible values (outcomes) for each location rather than just
one interpolated value. Geostatistical analysis has also evolved from uni- to multivariate
and offers mechanisms to incorporate secondary datasets that complement a (possibly
sparse) primary variable of interest, thus allowing the construction of more accurate
interpolation and uncertainty models.

Geostatistics is widely used in many areas of science and engineering, for example:

 The mining industry uses geostatistics for several aspects of a project: initially to
quantify mineral resources and evaluate the project's economic feasibility, then on
a daily basis in order to decide which material is routed to the plant and which is
waste, using updated information as it becomes available.

 In the environmental sciences, geostatistics is used to estimate pollutant levels in


order to decide if they pose a threat to environmental or human health and warrant
remediation.

 Relatively new applications in the field of soil science focus on mapping soil
nutrient levels (nitrogen, phosphorus, potassium, and so on) and other indicators
(such as electrical conductivity) in order to study their relationships to crop yield
and prescribe precise amounts of fertilizer for each location in the field.

 Meteorological applications include prediction of temperatures, rainfall, and


associated variables (such as acid rain).
 Most recently, there have been several applications of geostatistics in the area of
public health, for example, the prediction of environmental contaminant levels and
their relation to the incidence rates of cancer.

In all of these examples, the general context is that there is some phenomenon of interest
occurring in the landscape (the level of contamination of soil, water, or air by a pollutant;
the content of gold or some other metal in a mine; and so forth). Exhaustive studies are
expensive and time consuming, so the phenomenon is usually characterized by taking
samples at different locations. Geostatistics is then used to produce predictions (and
related measures of uncertainty of the predictions) for the unsampled locations. A
generalized workflow for geostatistical studies is described in The geostatistical workflow.

1.2. The geostatistical workflow

In this topic, a generalized workflow for geostatistical studies is presented, and the main
steps are explained. As mentioned in What is geostatistics, geostatistics is class of
statistics used to analyze and predict the values associated with spatial or spatiotemporal
phenomena. ArcGIS Geostatistical Analyst provides a set of tools that allow models that
use spatial (and temporal) coordinates to be constructed. These models can be applied to
a wide variety of scenarios and are typically used to generate predictions for unsampled
locations, as well as measures of uncertainty for those predictions.
The first step, as in almost any data-driven study, is to closely examine the data. This
typically starts by mapping the dataset, using a classification and color scheme that allow
clear visualization of important characteristics that the dataset might present, for example,
a strong increase in values from north to south (Trend—see Trend_analysis); a mix of high
and low values in no particular arrangement (possibly a sign that the data was taken at a
scale that does not show spatial correlation—see
Examining_spatial_structure_and_directional_variation); or zones that are more densely
sampled (preferential sampling) and may lead to the decision to use declustering weights
in the analysis of the data—see Implementing_declustering_to_adjust_for_preferential
sampling. See Map the data for a more detailed discussion on mapping and classification
schemes.

The second stage is to build the geostatistical model. This process can entail several
steps, depending on the objectives of the study (that is, the type(s) of information the
model is supposed to provide) and the features of the dataset that have been deemed
important enough to incorporate. At this stage, information collected during a rigorous
exploration of the dataset and prior knowledge of the phenomenon determine how
complex the model is and how good the interpolated values and measures of uncertainty
will be. In the figure above, building the model can involve preprocessing the data to
remove spatial trends, which are modeled separately and added back in the final step of
the interpolation process (see Trend_analysis); transforming the data so that it follows a
Gaussian distribution more closely (required by some methods and model outputs—see
About_examining_the_distribution_of_the_data); and declustering the dataset to
compensate for preferential sampling. While a lot of information can be derived by
examining the dataset, it is important to incorporate any knowledge you might have of the
phenomenon. The modeler cannot rely solely on the dataset to show all the important
features; those that do not appear can still be incorporated into the model by adjusting
parameter values to reflect an expected outcome. It is important that the model be as
realistic as possible in order for the interpolated values and associated uncertainties to be
accurate representations of the real phenomenon.

In addition to preprocessing the data, it may be necessary to model the spatial structure
(spatial correlation) in the dataset. Some methods, like kriging, require this to be explicitly
modeled using semivariogram or covariance functions (see
Semivariograms_and_covariance_functions); whereas other methods, like Inverse
Distance Weighting, rely on an assumed degree of spatial structure, which the modeler
must provide based on prior knowledge of the phenomenon.

A final component of the model is the search strategy. This defines how many data points
are used to generate a value for an unsampled location. Their spatial configuration
(location with respect to one another and to the unsampled location) can also be defined.
Both factors affect the interpolated value and its associated uncertainty. For many
methods, a search ellipse is defined, along with the number of sectors the ellipse is split
into and how many points are taken from each sector to make a prediction (see
Search_neighborhood).

Once the model has been completely defined, it can be used in conjunction with the
dataset to generate interpolated values for all unsampled locations within an area of
interest. The output is usually a map showing values of the variable being modeled. The
effect of outliers can be investigated at this stage, as they will probably change the model's
parameter values and thus the interpolated map. Depending on the interpolation method,
the same model can also be used to generate measures of uncertainty for the interpolated
values. Not all models have this capability, so it is important to define at the start if
measures of uncertainty are needed. This determines which of the models are suitable
(see Classification trees).

As with all modeling endeavors, the model's output should be checked, that is, make sure
that the interpolated values and associated measures of uncertainty are reasonable and
match your expectations.

Once the model has been satisfactorily built, adjusted, and its output checked, the results
can be used in risk analyses and decision making.

1.3. What is the Extensión ArcGIS Geostatistical Analyst?

The Extensión ArcGIS Geostatistical Analyst provides the capability for surface modeling
using deterministic and geostatistical methods. The tools it provides are fully integrated
with the GIS modeling environments and allow GIS professionals to generate interpolation
models and assess their quality before using them in any further analysis. Surfaces (model
output) can subsequently be used in models (both in the ModelBuilder and Python
environments), visualized, and analyzed using other ArcGIS extensions, such as
Extensión ArcGIS Spatial Analyst and Extensión 3D Analyst de ArcGIS.
The tools provided in the Extensión ArcGIS Geostatistical Analyst are grouped into three
categories:

 The Geostatistical Analyst toolbar gives access to a series of Exploratory Spatial


Data Analysis (ESDA) graphs.

 The Geostatistical Wizard (also accessed through the toolbar) leads analysts
through the process of creating and evaluating an interpolation model.

 A set of geoprocessing tools that are specially designed to work with the outputs of
the models and extend the capabilities of the Geostatistical Wizard.

The following topics provide information on how to access and configure the Extensión
ArcGIS Geostatistical Analyst and a quick tour of its main components and features:

 How to access and configure the extension: Enabling the Extensión ArcGIS
Geostatistical Analyst and Adding the Geostatistical Analyst toolbar to ArcMap

 A quick tour of the main components and features: A quick tour of Geostatistical
Analyst

1.4. Essential vocabulary for Geostatistical Analyst

The following terms and concepts arise repeatedly in geostatistics and within
Geostatistical Analyst.

Term Description

Cross validation A technique used to assess how accurate an interpolation model is. In
Geostatistical Analyst, cross validation leaves one point out and uses the rest
to predict a value at that location. The point is then added back into the
dataset, and a different one is removed. This is done for all samples in the
dataset and provides pairs of predicted and known values that can be
compared to assess the model's performance. Results are usually
summarized as Mean and Root Mean Squared errors.

Deterministic In Geostatistical Analyst, deterministic methods are those that create


methods surfaces from measured points, based on either an extent of similarity (for
example, inverse distance weighted) or a degree of smoothing (for example,
radial basis functions). They do not provide a measure of uncertainty (error)
of the predictions.

Geostatistical Results produced by the Geostatistical Wizard and many of the


layer geoprocessing tools in the Geostatistical Analyst toolbox are stored in a
surface called a geostatistical layer. Geostatistical layers can be used to
make maps of the results, view and revise the interpolation method's
parameter values (by opening then in the Geostatistical Wizard), create other
types of geostatistical layers (such as prediction error maps), and export the
results to raster or vector (contour, filled contour and points) formats.

Geostatistical In Geostatistical Analyst, geostatistical methods are those that are based on
methods statistical models that include autocorrelation (the statistical relationships
among the measured points). These techniques have the capability of
producing prediction surfaces, and also some measure of the uncertainty
(error) associated with these predictions.

Interpolation A process that uses measured values taken at known sample locations to
predict (estimate) values for unsampled locations. Geostatistical Analyst
offers several interpolation methods, which differ based on their underlying
assumptions, data requirements, and capabilities to generate different types
of output (for example, predicted values as well as the errors [uncertainties]
associated with them).

Kernel A weighting function used in several of the interpolation methods offered in


Geostatistical Analyst. Typically, higher weights are assigned to sample
values that are close to the location where a prediction is being made, and
lower weights are assigned to sample values that are further away.

Kriging A collection of interpolation methods that rely on semivariogram models of


spatial autocorrelation to generate predicted values, errors associated with
the predictions, and other information regarding the distribution of possible
values for each location in the study area (through quantile and probability
maps, or via geostatistical simulation, which provides a set of possible values
for each location).

Search Most of the interpolation methods use a local subset of the data to make
neighborhood predictions. Imagine a moving window—only data within the window is used
to make a prediction at the center of the window. This is done because there
is redundant information in samples that are far away from the location where
we need to make a prediction and to speed up the computing time required
to generate predicted value for the entire study area. The choice of
neighborhood (number of nearby samples and their spatial configuration
within the window) will affect the prediction surface, and should be chosen
with care.

Semivariogram A function that describes the differences (variance) between samples


separated by varying distances. Typically, the semivariogram will show low
variance for small differences and larger variances at greater separation
distances, indicating that the data is spatially autocorrelated. Semivariograms
estimated from sample data are empirical semivariograms. They are
represented as a set of points on a graph. A function is fitted to these points
and is known as a semivariogram model. The semivariogram model is a key
component in kriging (a powerful interpolation method that can provide
predicted values, errors associated with the predictions, and information
about the distribution of possible values for each location in the study area).

Simulation In geostatistics, this refers to a technique that extends kriging by producing


many possible versions of a predicted surface (in contrast to kriging, which
produces one surface). The set of predicted surfaces provides a wealth of
information that can be used to describe the uncertainty in a predicted value
for a particular location, the uncertainty for a set of predicted values in an
area of interest, or a set of predicted values that can be used as input to a
second model (physical, economic, and so forth) to assess risk and make
better informed decisions.

Spatial Natural phenomena often present spatial autocorrelation—that sample


autocorrelation values taken close to one another are more alike than samples taken far
away from each other. Some interpolation methods require an explicit model
of spatial autocorrelation (for example, kriging), others rely on assumed
degrees of spatial autocorrelation without providing a means to measure it
(for example, Inverse Distance Weighting), and others do not require any
notion of the spatial autocorrelation in the dataset. Note that when spatial
autocorrelation exists, traditional statistical methods (which rely on the
independence among observations) cannot be used reliably.

Transformation A data transformation is done when a function (log, Box-Cox, arcsin, Normal
score) is applied to the data to change the shape of its distribution and/or
stabilize the variance (reduce the relationship between the mean and
variance, for example, that data variability increases as the mean value
increases).

Validation Similar to cross validation, but instead of using the same dataset to build and
evaluate the model, two datasets are used—one to build the model and the
other as an independent test of performance. If only one dataset is available,
the Subset Features tool can be used to randomly split it into training and test
subsets.

1.5. A quick tour of Geostatistical Analyst

There are three main components of Geostatistical Analyst:

 A set of exploratory spatial data analysis (ESDA) graphs

 The Geostatistical Wizard

 The Geostatistical Analyst toolbox, which houses geoprocessing tools specifically


designed to extend the capabilities of the Geostatistical Wizard and allow further
analysis of the surfaces it generates
The ESDA graphs and the Geostatistical Wizard are accessed through the Geostatistical
Analyst toolbar, which must be added to the ArcMap display once the Geostatistical
Analyst extension has been enabled (see Enabling the Geostatistical Analyst extension
and Adding the Geostatistical Analyst toolbar to ArcMap).

Exploratory spatial data analysis graphs

Before using the interpolation techniques, you should explore your data using the
exploratory spatial data analysis tools. These tools allow you to gain insights into your data
and to select the most appropriate method and parameters for the interpolation model. For
example, when using ordinary kriging to produce a quantile map, you should examine the
distribution of the input data because this particular method assumes that the data is
normally distributed. If your data is not normally distributed, you should include a data
transformation as part of the interpolation model. A second example is that you might
detect a spatial trend in your data using the ESDA tools and want to include a step to
model it independently as part of the prediction process.

The ESDA tools are accessed through the Geostatistical Analyst toolbar (shown below)
and are composed of the following:

 Histogram—Examine the distribution and summary statistics of a dataset.


 Normal QQ Plot and General QQ Plot—Assess whether a dataset is normally
distributed and explore whether two datasets have similar distributions,
respectively.
 Voronoi Maps—Visually examine the spatial variability and stationarity of a dataset.
 Trend Analysis—Visualize and examine spatial trends in a dataset.
 Semivariogram/Covariance Cloud —Evaluate the spatial dependence
(semivariogram and covariance) in a dataset.
 Crosscovariance Cloud—Assess the spatial dependence (covariance) between two
datasets.

The ESDA graphs are shown below.

Tools for exploring a single dataset

The following graphic illustrates the ESDA tools used for analyzing one dataset at a time:

Tools for exploring relationships between datasets

The following graphic depicts the two tools that are designed to examine relationships
between two datasets:
Geostatistical Wizard

The Geostatistical Wizard is accessed through the Geostatistical Analyst toolbar, as


shown below:

The Geostatistical Wizard is a dynamic set of pages that is designed to guide you through
the process of constructing and evaluating the performance of an interpolation model.
Choices made on one page determine which options will be available on the following
pages and how you interact with the data to develop a suitable model. The wizard guides
you from the point when you choose an interpolation method all the way to viewing
summary measures of the model's expected performance. A simple version of this
workflow (for inverse distance weighted interpolation) is represented graphically below:
During construction of an interpolation model, the wizard allows changes in parameter
values, suggests or provides optimized parameter values, and allows you to move forward
or backward in the process to assess the cross-validation results to see whether the
current model is satisfactory or some of the parameter values should be modified. This
flexibility, in addition to dynamic data and surface previews, makes the wizard a powerful
environment in which to build interpolation models.

The Geostatistical Wizard provides access to a number of interpolation techniques, which


are divided into two main types: deterministic and geostatistical.

Deterministic methods

Deterministic techniques have parameters that control either (1) the extent of similarity (for
example, inverse distance weighted) of the values or (2) the degree of smoothing (for
example, radial basis functions) in the surface. These techniques are not based on a
random spatial process model, and there is no explicit measurement or modeling of spatial
autocorrelation in the data. Deterministic methods include the following:

 Global polynomial interpolation


 Local polynomial interpolation
 Inverse distance weighted
 Radial basis functions
 Interpolation with barriers (using impermeable or semipermeable barriers in the
interpolation process)
o Diffusion kernel
o Kernel smoothing

Geostatistical methods

Geostatistical techniques assume that at least some of the spatial variation observed in
natural phenomena can be modeled by random processes with spatial autocorrelation and
require that the spatial autocorrelation be explicitly modeled. Geostatistical techniques can
be used to describe and model spatial patterns (variography), predict values at
unmeasured locations (kriging), and assess the uncertainty associated with a predicted
value at the unmeasured locations (kriging).

The Geostatistical Wizard offers several types of kriging, which are suitable for different
types of data and have different underlying assumptions:

 Ordinary
 Simple
 Universal
 Indicator
 Probability
 Disjunctive
 Areal interpolation
 Empirical Bayesian

These methods can be used to produce the following surfaces:

 Maps of kriging predicted values


 Maps of kriging standard errors associated with predicted values
 Maps of probability, indicating whether a predefined critical level was exceeded
 Maps of quantiles for a predetermined probability level

There are exceptions to this:

 Indicator and probability kriging produce the following:


o Maps of probability, indicating whether a predefined critical level was
exceeded
o Maps of standard errors of indicators
 Areal interpolation produces the following:
o Maps of predicted values
o Maps of standard errors associated with predicted values

Geostatistical Analyst toolbox

The Geostatistical Analyst toolbox includes tools for analyzing data, producing a variety of
output surfaces, examining and transforming geostatistical layers to other formats,
performing geostatistical simulation and sensitivity analysis, and aiding in designing
sampling networks. The tools have been grouped into five toolsets:

 Interpolation—Contains geoprocessing tools that perform interpolation (as does the


Geostatistical Wizard) that can be used as stand-alone tools or in ModelBuilder
and Python
 Sampling Network Design—Has tools that aid in designing or modifying an existing
sampling design/monitoring network
 Simulation—Extends kriging by performing geostatistical simulation and permits
extraction of the simulated results for points or polygonal areas
 Utilities—General use tools to extract subsets of a dataset, perform cross-
validation to assess model performance, examine sensitivity to variation in
semivariogram parameters, and visually represent the neighborhoods used by the
interpolation tools
 Working with Geostatistical Layers—Has tools that generate predictions for point
locations, export geostatistical layers to raster and vector formats, retrieve and set
interpolation model parameters (in an XML parameter file), and generate new
geostatistical layers (based on an XML parameter file and datasets)
Subset Features

While cross-validation is provided for all methods available in the Geostatistical Wizard
and can also be run for any geostatistical layer using the Cross Validation geoprocessing
tool, a more rigorous way to assess the quality of an output surface is to compare
predicted values with measurements that were not used to construct the interpolation
model. As it is not always possible to go back to the study area to collect an independent
validation dataset, one solution is to divide the original dataset into two parts. One part can
be used to construct the model and produce a surface. The other part can be used to
compare and validate the output surface. The Subset Features tool enables you to split a
dataset into training and test datasets. The Subset Features tool is a geoprocessing tool
(housed in the Geostatistical Analyst toolbox shown in the section above). For
convenience, this tool is also available from the Geostatistical Analyst toolbar, as shown in
the following figure:
For further information on this tool and how to use it, see How Subset Features works in
Geostatistical Analyst and Using validation to assess models.

1.5.1. The process of building an interpolation model

Geostatistical Analyst includes many tools for analyzing data and producing a variety of
output surfaces. While the reasons for your investigations might vary, you're encouraged
to adopt the approach described in The geostatistical workflow when analyzing and
mapping spatial processes:

 Represent the data—Create layers and display them in ArcMap.


 Explore the data—Examine the statistical and spatial properties of your datasets.
 Choose an appropriate interpolation method—The choice should be driven by the
objectives of the study, your understanding of the phenomenon, and what you
require the model to provide (as output).
 Fit the model—To create a surface. The Geostatistical Wizard is used in the
definition and refinement of an appropriate model.
 Perform diagnostics—Check that the results are reasonable (expected), and
evaluate the output surface using cross-validation and validation. This helps you
understand how well the model predicts the values at unsampled locations.

Both the Geostatistical Wizard and Geostatistical Analyst toolbox offer many interpolation
methods. You should always have a clear understanding of the objectives of your study
and how the predicted values (and other associated information) will help you make more
informed decisions when choosing a method. To provide some guidance, see
Classification trees for a set of classification trees of the diverse methods.

(Data shown in the figures was provided courtesy of the Alaska Fisheries Science Center.)

2. Choosing the right method

2.1. An introduction to interpolation methods

Geostatistics, as mentioned in the introductory topic What is geostatistics?, is a collection


of methods that allow you to estimate values for locations where no samples have been
taken and also to assess the uncertainty of these estimates. These functions are critical in
many decision-making processes, as it is impossible in practice to take samples at every
location in an area of interest.

It is important to remember, however, that these methods are a means that allows you to
construct models of reality (that is, of the phenomenon you are interested in). It is up to
you, the practitioner, to build models that suit your specific needs and provide the
information necessary to make informed and defensible decisions. A big part of building a
good model is your understanding of the phenomenon, how the sample data was obtained
and what it represents, and what you expect the model to provide. General steps in the
process of building a model are described in The geostatistical workflow.

Many interpolation methods exist. Some are quite flexible and can accommodate different
aspects of the sample data. Others are more restrictive and require that the data meet
specific conditions. Kriging methods, for example, are quite flexible, but within the kriging
family there are varying degrees of conditions that must be met for the output to be valid.
Geostatistical Analyst offers the following interpolation methods:

 Global polynomial

 Local polynomial

 Inverse distance weighted

 Radial basis functions

 Diffusion kernel

 Kernel smoothing

 Ordinary kriging

 Simple kriging

 Universal kriging

 Indicator kriging

 Probability kriging

 Disjunctive kriging

 Gaussian geostatistical simulation

 Areal interpolation
 Empirical Bayesian kriging

Each of these methods has its own set of parameters, allowing it to be customized for a
particular dataset and requirements on the output that it generates. To provide some
guidance in selecting which to use, the methods have been classified according to several
different criteria, as shown in Classification trees of the interpolation methods offered in
Geostatistical Analyst. After you clearly define the goal of developing an interpolation
model and fully examine the sample data, these classification trees may be able to guide
you to an appropriate method.

2.2. Classification trees of the interpolation methods offered in Geostatistical


Analyst

One of the most important decisions you will have to make is to define what your
objective(s) is in developing an interpolation model. In other words, what information do
you need the model to provide so that you can make a decision? For example, in the
public health arena, interpolation models are used to predict levels of contaminants that
can be statistically associated with disease rates. Based on that information, further
sampling studies can be designed, public health policies can be developed, and so on.

Geostatistical Analyst offers many different interpolation methods. Each has unique
qualities and provides different information (in some cases, methods provide similar
information; in other cases, the information may be quite different). The following diagrams
show these methods classified according to different criteria. Choose a criterion that is
important for your particular situation and a branch in the corresponding tree that
represents the option that you are interested in. This will lead you to one or more
interpolation methods that may be appropriate for your situation. Most likely, you will have
several important criteria to meet and will use several of the classification trees. Compare
the interpolation methods suggested by each tree branch you follow and pick a few
methods to contrast before deciding on a final model.

The first tree suggests methods based on their ability to generate predictions or
predictions and associated errors.
Some methods require a model of spatial autocorrelation to generate predicted values, but
others do not. Modeling spatial autocorrelation requires defining extra parameter values
and interactively fitting a model to the data.

Different methods generate different types of output, which is why you must decide what
type of information you need to generate prior to building the interpolation model.
Interpolation methods vary in their levels of complexity, which can be measured by the
number of assumptions that must be met for the model to be valid.

Some interpolators are exact (at each input data location, the surface will have exactly the
same value as the input data value), while others are not. Exact replication of the input
data may be important in some situations.
Some methods produce surfaces that are smoother than others. Radial basis functions are
smooth by construction, for example. The use of a smooth search neighborhood will
produce smoother surfaces than a standard search neighborhood.

For some decisions, it is important to consider not only the predicted value at a location
but also the uncertainty (variability) associated with that prediction. Some methods provide
measures of uncertainty, while others do not.
Finally, processing speed may be a factor in your analysis. In general, most of the
interpolation methods are relatively fast, except when barriers are used to control the
interpolation process.

The classification trees use the following abbreviations for the interpolation methods:

Abbreviation Method name


GPI Global Polynomial Interpolation
LPI Local Polynomial Interpolation
IDW Inverse Distance Weighted
RBF Radial Basis Functions
KSB Kernel Smoothing with Barriers
DKB Diffusion Kernel with Barriers
Kriging Ordinary, simple, universal, indicator, probability, disjunctive, and
empirical Bayesian kriging
Simulation Gaussian geostatistical simulation, based on a simple kriging model

2.3. Examining and understanding your data

2.3.1. The importance of knowing your data

As mentioned in The geostatistical workflow, there are many stages involved in creating a
surface. The first is to fully explore the data and identify important features that will be
incorporated in the model. These features must be identified at the beginning of the
process because a number of choices have to be made and parameter values have to be
specified in each stage of building the model. Note that, in the Geostatistical Wizard, the
choices you make determine the options that are available in the following steps of the
process, so it is important to identify the main features of the model before starting to build
it. While the Geostatistical Wizard provides reliable default values (some of which are
calculated specifically for your data), it cannot interpret the context of your study or the
objectives you have in creating the model. It is critical that you create and refine the model
based on additional insights gained from prior knowledge of the phenomenon and data
exploration in order to generate a more accurate surface.

The following topics provide more detail regarding data exploration and information on how
to use the findings when building an interpolation model:

 Map the data—Covers the first step in data exploration: mapping the data using a
classification scheme that shows the important features.

 Exploratory Spatial Data Analysis—Provides an overview of the Exploratory Spatial


Data Analysis (ESDA) tools and their uses.

 Data distributions and transformations—Covers the Histogram, Normal QQ Plot,


and General QQ Plot tools, as well as data transformations.

 Looking for global and local outliers—Presents techniques for identifying global and
local outliers using the Histogram, Semivariogram/Covariance Cloud, and Voronoi
Map tools.

 Trend analysis—Examines how to identify global trends in the data using the Trend
Plot tool.
 Examining local variation—Indicates how to use the Voronoi Map tool to show
whether the local mean and local standard deviation are relatively constant over
the study area (a visualization of stationarity). The tool also provides other local
factors (including clustering) that can be useful in identifying outliers.

 Examining spatial autocorrelation—Demonstrates how the semivariogram and


covariance and cross-covariance clouds are built and how they are used to explore
spatial autocorrelation and spatial cross-covariance in the data.

2.4. Map the data

The first step for any analysis is to map and examine the data. This provides you with a
first look at the spatial components of the dataset and may give indications of outliers and
erroneous data values, global trends, and the dominant directions of spatial
autocorrelation, among other factors, all of which are important in the development of an
interpolation model that accurately reflects the phenomenon you are interested in.

ArcGIS offers many ways to visualize data: ArcMap provides access to many classification
schemes and color ramps, which can be used to highlight different aspects of the data,
whereas ArcScene allows the data to be rendered in 3D space, which is useful when
looking for local outliers and global trends. While there is no one correct way to display the
data, the following figures illustrate different renderings of the same data that allow
different aspects of interest to be seen. For more detailed information on classification
schemes available in ArcGIS, see Classifying numerical fields for graduated symbols.

The initial view of the data provided by ArcMap uses the same symbol for all the sample
points. This view provides information on the spatial extent of the samples, coverage of the
study area (if a boundary is available), and indicates whether there were areas that were
more heavily or intensely sampled than others (called preferential sampling). In some
interpolation models (specifically simple kriging models built as a basis for geostatistical
simulation and disjunctive kriging models), it is important to use a declustering technique
(see Implementing declustering to adjust for preferential sampling) to obtain a dataset that
is representative of the phenomenon and is not affected by oversampling in high- or low-
valued regions of the study area.
A second step in mapping the data is to use a classification scheme and color ramp that
show data values and their spatial relationship. By default, ArcMap will apply a natural
breaks classification to the data. This is shown in the following figure, which uses five
classes and a color scheme with blue for cold water temperatures and red for warmer
water temperatures.
Natural breaks looks for statistically large differences between adjacent pairs of data (the
data is sorted by value, not by location). In this case, warmer temperatures occur on the
westernmost samples, while those in the center of the study area are colder. Samples
closest to mainland Alaska show warmer temperatures. The map also shows that
temperatures are fairly constant along lines going from the northwest to the southeast.
These two findings can be interpreted as a trough of colder water in the center of the
sampled area, which runs from the northwest toward the southeast. This is a global trend
in the data and can be modeled as a second order polynomial using global polynomial
interpolation or local polynomial interpolation or as a trend in kriging.

Other methods that can be used to classify the data are equal interval (which uses classes
of equal width) and quantile (which breaks the data into classes that all have the same
number of data values). Both of these classifications are shown below and essentially
show the same spatial features as the natural breaks classification for this dataset.
A different view of the data is provided by a classification based on the statistical
distribution of the data values. This rendering can be helpful in identifying outliers and
erroneous data. The following figure uses the standard deviation classification and a color
ramp that shows positive deviations from the mean in red and negative deviations from the
mean in blue.

This classification refines the preliminary assessment: positive deviations from the mean
occur in the westernmost samples, while in the center of the sampled area, there is a zone
of colder temperatures (negative deviations from the mean) running from the northwest to
the southeast. Samples closest to the Alaskan mainland do not deviate much from the
mean (shown in yellow). The standard deviation classification can be adjusted manually to
represent a more common approach to finding outlying values: the class breaks are
adjusted to show values that deviate more than one standard deviation from the mean.
The central portion of the data (that is, values that fall between the mean minus one
standard deviation and the mean plus one standard deviation) will contain 64 percent of
the data values if the data is normally (Gaussian) distributed. This adjusted classification is
shown below and shows more clearly those values that deviate significantly from the
mean. In this case, the standard deviation classification confirms what was observed in
using the natural breaks, equal interval, and quantile classifications.
In visually exploring the data, it may be worthwhile to investigate how the number of
classes affects the rendering of the data. The number of classes should be sufficient to
show local detail in the data values but not so many that general features would be hidden.
For the data used in these examples, five classes were adequate. Nine classes did not
add much to the maps and made interpreting the main spatial features less
straightforward.

2.5. Exploratory Spatial Data Analysis

2.5.1. Quantitative data exploration

After mapping the data, a second stage of data exploration should be performed using the
Exploratory Spatial Data Analysis (ESDA) tools. These tools allow you to examine the data
in more quantitative ways than mapping it and let you gain a deeper understanding of the
phenomena you are investigating so that you can make more informed decisions on how
the interpolation model should be constructed. The most common tasks you should
perform to explore your data are the following:

 Examine the distribution of your data

 Look for global and local outliers


 Look for global trends

 Examine local variation

 Examine spatial autocorrelation

Not all these steps are necessary in all cases. For example, if you decide to use an
interpolation method that does not require a measure of spatial autocorrelation (GPI, LPI,
or RBF), then it is not necessary to explore spatial autocorrelation in the data. It may,
however, be a good idea to explore it anyway, as a significant amount of spatial
autocorrelation can lead to using a different interpolation method (kriging, for example)
than the one you had originally intended to use.

2.5.2. The ESDA tools

To help you accomplish these tasks, the ESDA tools allow different views into the data.
These views can be manipulated and explored, and all are interconnected among
themselves and with the data displayed in ArcMap through brushing and linking.

The ESDA tools are:

 Histogram

 Normal QQ Plot and General QQ Plot

 Trend Analysis

 Voronoi Map

 Semivariogram/Covariance Cloud

 Crosscovariance Cloud

2.5.3. Working with the ESDA tools: brushing and linking

The views in ESDA are interconnected by selecting (brushing) and highlighting the
selected points on all maps and graphs (linking). Brushing is a graphic way to perform a
selection in either the ArcMap data view or in an ESDA tool. Any selection that occurs in
an ESDA view or in the ArcMap data view is selected in all the ESDA windows as well as
in ArcMap, which is linking.

For the Histogram, Voronoi Map, QQ Plot, and Trend Analysis tools, the graph bars,
points, or polygons that are selected in the tool view are linked to points in the ArcMap
data view, which are also highlighted. For the Semivariogram/Covariance tools, points in
the plots represent pairs of locations, and when some points are selected in the tool, the
corresponding pairs of points are highlighted in the ArcMap data view, with a line
connecting each pair. When pairs of points in the ArcMap data view are selected, the
corresponding points are highlighted in the Semivariogram/Covariance plot.

2.6. • Data distributions and transformations

2.6.1. Examine the distribution of your data

Most of the interpolation methods provided by ArcGIS Geostatistical Analyst do not require
the data to be normally distributed, although in this case the prediction map may not be
optimal. That is, data transformations that change the shape (distribution) of the data are
not required as part of the interpolation model. However, certain kriging methods require
the data to be approximately normally distributed (close to a bell-shaped curve). In
particular, quantile and probability maps created using ordinary, simple, or universal
kriging assume that the data comes from a multivariate normal distribution. In addition,
simple kriging models, which are used as a basis for geostatistical simulation (that is,
models used as input to the Gaussian Geostatistical Simulation tool—refer to
Geostatistical simulation concepts and How Gaussian geostatistical simulations work),
should use data that is normally distributed or include a normal score transformation as
part of the model to ensure this.

Normally distributed data has a probability density function that looks like the one shown in
the following diagram:
The Histogram and Normal QQ plot are designed to help you explore the distribution of
your data, and they include different data transformations (Box–Cox, logarithmic, and
arcsine) so that you can assess the effects they have on the data. To learn more about the
transformations that are available in these tools, see Box-Cox, Arcsine, and Log
transformations.

All kriging methods rely on the assumption of stationarity. This assumption requires, in
part, that all data values come from distributions that have the same variability. Data
transformations can also be used to satisfy this assumption of equal variability. For more
information on stationarity, see Random processes with dependence.

2.6.2. Examining the distribution of your data using histograms and normal
QQ plots

The ESDA tools (refer to Exploratory Spatial Data Analysis) help you examine the
distribution of your data.

When checking whether your data is normally distributed (close to a bell-shaped curve),
the Histogram and Normal QQ Plots will help you. In the summary statistics provided by
the Histogram, the mean and median will be similar, the skewness should be near zero,
and the kurtosis should be near 3 if the data is normally distributed. If the data is highly
skewed, you may choose to transform it to see if you can make it more normally
distributed. Note that back transformation process generates approximately unbiased
predictions with approximate kriging standard errors when you use Universal or Ordinary
Kriging.

The Normal QQ plot provides a visual comparison of your dataset to a standard normal
distribution, and you can investigate points that cause departures from a normal
distribution by selecting them in the plot and examining their locations on a map. For an
example, refer to Normal QQ and general QQ plots. Data transformations can also be
used in the Normal QQ Plot.

Pasos:

 Click the point feature layer in the ArcMap table of contents that you want to
examine.
 Click the Geostatistical Analyst toolbar, point to Explore Data, then click either
Histogram or Normal QQ Plot.
Sugerencia: In the Histogram, make sure that the Statistics box is checked to see
summary statistics for the data.

Sugerencia: In the Normal QQ Plot, the points will fall close to the 45 degree reference line
if the data is normally distributed.

2.6.3. Histograms

The Histogram tool provides a univariate (one-variable) description of your data. The tool
dialog box displays the frequency distribution for the dataset of interest and calculates
summary statistics.

Frequency distribution

The frequency distribution is a bar graph that displays how often observed values fall
within certain intervals or classes. You can specify the number of classes of equal width
that are used in the histogram. The relative proportion of data that falls in each class is
represented by the height of each bar. For example, the histogram below shows the
frequency distribution (10 classes) for a dataset.

Summary statistics
The important features of a distribution can be summarized by statistics that describe its
location, spread, and shape.

Measures of location

Measures of location provide you with an idea of where the center and other parts of the
distribution lie.

 The mean is the arithmetic average of the data. The mean provides a measure of
the center of the distribution.

 The median value corresponds to a cumulative proportion of 0.5. If the data was
arranged in increasing order, 50 percent of the values would lie below the median,
and 50 percent of the values would lie above the median. The median provides
another measure of the center of the distribution.

 The first and third quartiles correspond to the cumulative proportion of 0.25 and
0.75, respectively. If the data was arranged in increasing order, 25 percent of the
values would lie below the first quartile, and 25 percent of the values would lie
above the third quartile. The first and third quartiles are special cases of quantiles.
The quantiles are calculated as follows:

quantile = (i - 0.5) / N

where i is the ith ordered data value.

Measures of spread

The spread of points around the mean value is another characteristic of the displayed
frequency distribution.

 The variance of the data is the average squared deviation of all values from the
mean. Because it involves squared differences, the calculated variance is sensitive
to unusually high or low values. The variance is estimated by summing the squared
deviations from the mean and dividing the sum by (N-1).

 The standard deviation is the square root of the variance, and it describes the
spread of the data about the mean. The smaller the variance and standard
deviation, the tighter the cluster of measurements about the mean value.

The diagram below shows two distributions with different standard deviations. The
frequency distribution represented by the black line is more variable (wider spread) than
the frequency distribution represented by the red line. The variance and standard deviation
for the black frequency distribution are greater than those for the red frequency
distribution.

Measures of shape

The frequency distribution is also characterized by its shape.

The coefficient of skewness is a measure of the symmetry of a distribution. For symmetric


distributions, the coefficient of skewness is zero. If a distribution has a long right tail of
large values, it is positively skewed, and if it has a long left tail of small values, it is
negatively skewed. The mean is larger than the median for positively skewed distributions
and vice versa for negatively skewed distributions. The image below shows a positively
skewed distribution.
Kurtosis is based on the size of the tails of a distribution and provides a measure of how
likely it is that the distribution will produce outliers. The kurtosis of a normal distribution is
equal to three. Distributions with relatively thick tails are termed leptokurtic and have
kurtosis greater than three. Distributions with relatively thin tails are termed platykurtic and
have a kurtosis less than three. In the following diagram, a normal distribution is given in
red, and a leptokurtic (thick-tailed) distribution is given in black.

Examples

With the Histogram tool, you can examine the shape of the distribution by direct
observation. By reviewing the mean and median statistics, you can determine the center
location of the distribution. Notice that in the figure below the distribution is bell-shaped,
and since the mean and median values are very close, this distribution is close to normal.
You can also highlight the extreme values in the tail of the histogram and see how they are
spatially located in the displayed map.
If your data is highly skewed, you can test the effects of a transformation on your data.
This figure shows a skewed distribution before a transformation is applied.
A log transformation is applied to the skewed data, and in this case, the transformation
makes the distribution close to normal.
2.6.4. Normal QQ plot and general QQ plot

Quantile-quantile (QQ) plots are graphs on which quantiles from two distributions are
plotted relative to each other.

How the Normal QQ plot is constructed

First, the data values are ordered and cumulative distribution values are calculated as (i–
0.5)/n for the ith ordered value out of n total values (this gives the proportion of the data
that falls below a certain value). A cumulative distribution graph is produced by plotting the
ordered data versus the cumulative distribution values (graph on the top left in the figure
below). The same process is done for a standard normal distribution (a Gaussian
distribution with a mean of 0 and a standard deviation of 1, shown in the graph on the top
right of the figure below). Once these two cumulative distribution graphs have been
generated, data values corresponding to specific quantiles are paired and plotted in a QQ
plot (bottom graph in the figure below).

How the general QQ plot is constructed

General QQ plots are used to assess the similarity of the distributions of two datasets.
These plots are created following a similar procedure as described for the Normal QQ plot,
but instead of using a standard normal distribution as the second dataset, any dataset can
be used. If the two datasets have identical distributions, points in the general QQ plot will
fall on a straight (45-degree) line.
Examining data distributions using QQ plots

Points on the Normal QQ plot provide an indication of univariate normality of the dataset. If
the data is normally distributed, the points will fall on the 45-degree reference line. If the
data is not normally distributed, the points will deviate from the reference line.

In the diagram below, the quantile values of the standard normal distribution are plotted on
the x-axis in the Normal QQ plot, and the corresponding quantile values of the dataset are
plotted on the y-axis. You can see that the points fall close to the 45-degree reference line.
The main departure from this line occurs at high values of ozone concentration.

The Normal QQ Plot tool allows you to select the points that do not fall close to the
reference line. The location of the selected points are then highlighted in the ArcMap data
view. As seen below, they are concentrated around the San Francisco Bay area (points
shaded in pink on the map below).
An example of using data transformations

A Normal QQ plot of an example dataset is presented here:


Notice how the points stray from the straight line.

However, as can be seen in the figure below, when a log transformation is applied to the
dataset, the points lie closer to the 45-degree reference line.
Box-Cox and arcsine transformations can also be applied to the data within the Normal
QQ Plot tool to assess their effect on the normality of the distribution.

2.6.5. Data transformations

Box-Cox, arcsine, and log transformations

Some methods in Geostatistical Analyst require that the data be normally distributed.
When the data is skewed (the distribution is lopsided), you might want to transform the
data to make it normal. The Histogram and Normal QQ Plot allow you to explore the
effects of different transformations on the distribution of the dataset. If the interpolation
model you build uses one of the kriging methods, and you choose to transform the data as
one of the steps, the predictions will be transformed back to the original scale in the
interpolated surface.

Geostatistical Analyst allows the use of several transformations including Box-Cox (also
known as power transformations), arcsine, and logarithmic. Suppose you observe data
Z(s), and apply some transformation Y(s) = t(Z(s)). Usually, you want to find the
transformation so that Y(s) is normally distributed. What often happens is that the
transformation also yields data that has constant variance through the study area.

Box-Cox transformation

The Box-Cox transformation is

Y(s) = (Z(s)λ - 1)/λ,

for λ≠ 0.

For example, suppose that your data is composed of counts of some phenomenon. For
these types of data, the variance is often related to the mean. That is, if you have small
counts in part of your study area, the variability in that local region will be smaller than the
variability in another region where the counts are larger. In this case, the square-root
transformation may help to make the variances more constant throughout the study area
and often makes the data appear normally distributed as well. The square-root
transformation is a special case of the Box-Cox transformation when λ = ½.

Log transformation

The log transformation is actually a special case of the Box-Cox transformation when λ =
0; the transformation is as follows:

Y(s) = ln(Z(s)),

for Z(s) > 0, and ln is the natural logarithm.

The log transformation is often used where the data has a positively skewed distribution
(shown below) and there are a few very large values. If these large values are located in
your study area, the log transformation will help make the variances more constant and
normalize your data. Concerning terminology, when a log transformation is implemented
with kriging, the prediction method is known as lognormal kriging, whereas for all other
values of λ, the associated kriging method is known as trans-Gaussian kriging.
Arcsine transformation

The arcsine transformation is shown below:

Y(s) = sin-1(Z(s)),

for Z(s) between 0 and 1.

The arcsine transformation can be used for data that represents proportions or
percentages. Often, when the data is proportions, the variance is smallest near 0 and 1
and largest near 0.5. The arcsine transformation will help make the variances more
constant throughout your study area and often makes the data appear normally distributed
as well.

2.6.6. Using Box-Cox, arcsine, and log transformations

Data transformations can be used to make the variances constant throughout your study
area and make the data distribution closer to normal. Understanding transformations and
trends provides a more detailed discussion of transformations in Geostatistical Analyst.

Use the Histogram Normal QQ plot and Voronoi Map in Exploratory Spatial Data Analysis
to try different transformations assess their effects on the distribution of the data. Box-Cox,
arcsine, and log transformations discusses each of these transformations in more detail.

Keep in mind that some geostatistical methods assume and require data with the normal
distribution of values. (Quantile and probability maps generated using Ordinary, Simple,
and Universal kriging, as well as any map generated using Disjunctive kriging and the
Gaussian Geostatistical Simulations geoprocessing tool).
Pasos:

 Choose Kriging/CoKriging under Geostatistical methods in the Geostatistical


Wizard method selection window.

 Click the Next button.

 Choose the desired transformation from the Transformation type drop-down menu
for each dataset to which you would like to apply a transformation.

 Click Next.

 Complete the remaining steps in the Geostatistical Wizard to create a surface.

Sugerencia: In creating the surface, the Geostatistical Wizard automatically back


transforms the predicted values into their original units. There is no need to back transform
the values yourself.

To decide which type of transformation might be the most appropriate for your data, go to
Explore Data of the Geostatistical Toolbar and choose the Histogram option. Use the
Transformation section of the interface to display the histogram after different
transformations have been applied to the data. Choose the transformation that makes your
data have a distribution that is closest to a normal distribution.

2.6.7. Normal score transformation

Some interpolation and simulation methods require the input data to be normally
distributed (refer to Examine the distribution of your data for a list of these methods). The
normal score transformation (NST) is designed to transform your dataset so that it closely
resembles a standard normal distribution. It does this by ranking the values in your dataset
from lowest to highest and matching these ranks to equivalent ranks generated from a
normal distribution. Steps in the transformation are as follows: your dataset is sorted and
ranked, an equivalent rank from a standard normal distribution is found for each rank from
your dataset, and the normal distribution values associated with those ranks make up the
transformed dataset. The ranking process can be done using the frequency distribution or
the cumulative distribution of the datasets.

Examples showing histograms and cumulative distributions before and after a normal
score transformation was applied are shown below:
Histograms before and after a normal score transformation

Cumulative distributions before and after a normal score transformation

Approximation methods

In Geostatistical Analyst, there are four approximation methods: direct, linear, Gaussian
kernels, and multiplicative skewing. The direct method uses the observed cumulative
distribution, the linear method fits lines between each step of the cumulative distribution,
and the Gaussian kernels method approximates the cumulative distribution by fitting a
linear combination of component cumulative normal distributions. Multiplicative skewing
approximates the cumulative distribution by fitting a base distribution (Student's t,
lognormal, gamma, empirical, and log empirical) that is then skewed by a fitted linear
combination of beta distributions (the skewing is done with the inverse probability integral
transformation). Lognormal, gamma, and log empirical base distributions can only be used
for positive data, and the predictions are guaranteed to be positive. Akaike's Information
Criterion (AIC) is provided to judge the quality of the fitted model.

After the Geostatistical Wizard makes predictions on the transformed scale, it


automatically transforms them back to the original scale. The choice of approximation
method depends on the assumptions you are willing to make and the smoothness of the
approximation. The direct method is the least smooth and has the fewest assumptions,
and the linear method is intermediate. The Gaussian kernels and multiplicative skewing
methods have smooth reverse transformations but assume that the data distribution can
be approximated by a finite combination of known distributions.

2.6.8. Using normal score transformations

Normal score transformation (NST) changes your data so that it follows a univariate
standard normal distribution. This is an important step when you create quantile or
probability maps using simple, probability, or disjunctive kriging. Also, simple kriging
models that will be used as input to the Gaussian Geostatistical Simulations tool should be
based on normally distributed data. The steps below describe how to apply a normal score
transformation to the data when using simple kriging.

Pasos:

 Choose Kriging/CoKriging under Geostatistical Methods on the Geostatistical


Wizard method selection window.

 Click Next.

 Choose Simple from the list of kriging methods. Transformation type is


automatically set to Normal Score.

 Click Next.

 Choose the number of bars you want to display in the Density chart by setting the
slider in the upper right part of the interface. Click Cumulative to switch the display
to a cumulative distribution of the data, or click Normal QQPlot to display the
normal quantile-quantile plot of the data after the transformation.

 Choose Direct, Linear, Gaussian Kernels, or Multiplicative Skewing from the


Approximation method list.
o If Multiplicative Skewing is chosen, choose a base distribution: Student's t,
Lognormal, Gamma, Empirical, or Log Empirical.

 If you are using cokriging and need to transform the other dataset(s), switch
datasets by clicking the Dataset Selection arrow. Define a normal score
transformation for the other datasets following steps 5 and 6.

 Click Next and continue with the other steps in Geostatistical Wizard to create a
surface using a normal score transformation on your data.

2.6.9. Comparing normal score transformations to other transformations

The Normal score transformation (NST) is different from the Box-Cox, arcsine, and log
transformations (BAL) in several ways:

 The NST function adapts to each particular dataset, whereas BAL transformations
do not (for example, the log transformation function always takes the natural
logarithm of the data).

 The goal of the NST is to make the random errors of the whole population (not only
the sample) normally distributed. Due to this, it is important that the cumulative
distribution of the sample accurately reflects the true cumulative distribution of the
whole population (this requires correct sampling of the population and possibly
declustering to account for preferential sampling in some locations of the study
area). BAL, on the other hand, affects the sample data and can have goals of
stabilizing the variance, correcting skewness, or making the distribution closer to
normally distributed.

 The NST must occur after detrending the data so that covariance and
semivariograms are calculated on residuals after trend correction. In contrast, BAL
transformations are used to attempt to remove any relationship between the
variance and the trend. Because of this, after the BAL transformation has been
applied to the data, you can optionally remove the trend and model spatial
autocorrelation. A consequence of this process is that you often get residuals that
are approximately normally distributed, but this is not a specific goal of BAL
transformations like it is for the NST transformation.

2.7. • Global and Local outliers


2.7.1. Looking for global and local outliers

A global outlier is a measured sample point that has a very high or a very low value
relative to all the values in a dataset. For example, if 99 out of 100 points have values
between 300 and 400, but the 100th point has a value of 750, the 100th point may be a
global outlier.

A local outlier is a measured sample point that has a value within the normal range for the
entire dataset, but if you look at the surrounding points, it is unusually high or low. For
example, the diagram below is a cross section of a valley in a landscape. However, there
is one point in the center of the valley that has an unusually high value relative to its
surroundings, but it is not unusual compared to the entire dataset.

Local outliers

It is important to identify outliers for two reasons: they may be real abnormalities in the
phenomenon, or the value might have been measured or recorded incorrectly.

If an outlier is an actual abnormality in the phenomenon, this may be the most significant
point of the study and for understanding the phenomenon. For instance, a sample on the
vein of a mineral ore might be an outlier and the location that is most important to a mining
company.

If outliers are caused by errors during data entry that are clearly incorrect, they should
either be corrected or removed before creating a surface. Outliers can have several
detrimental effects on your prediction surface because of effects on semivariogram
modeling and the influence of neighboring values.

Looking for outliers through the Histogram tool

The Histogram tool enables you to select points on the tail of the distribution. The selected
points are displayed in the ArcMap data view. If the extreme values are isolated locations
(for instance, surrounded by very different values), they may require further investigation
and be removed if necessary.
Histogram and QQ Plot Map

In the example above, the high ozone values are not outliers and should not be removed
from the dataset.

Identifying outliers through Semivariogram/Covariance cloud

If you have a global outlier with an unusually high value in your dataset, all pairings of
points with that outlier will have high values in the Semivariogram cloud, no matter what
the distance is. This can be seen in the semivariogram cloud and in the histogram shown
below. Notice that there are two main strata of points in the semivariogram. If you brush
points in the upper strata, as demonstrated in the image, you can see in the ArcMap view
that all these high values come from pairings with a single location— a global outlier. Thus,
the upper stratum of points has been created by all the locations pairing with the single
outlier, and the lower stratum is composed of the pairings among the rest of the locations.
When you look at the histogram, you can see one high value on the right tail of the
histogram, again identifying the global outlier. This value was probably entered incorrectly
and should be removed or corrected.

Global outlier

When there is a local outlier, the value will not be out of the range of the entire distribution
but will be unusual relative to the surrounding values. In the local outlier histogram shown
below, you can see that pairs of locations that are close together have high semivariogram
values (these points are on the far left on the x-axis, indicating that they are close together,
and have high values on the y-axis, indicating that the semivariogram values are high).
When these points are brushed, you can see that all these points are pairings to a single
location. When you look at the histogram, you can see that there is no single value that is
unusual. The location in question is highlighted in the lower tail of the histogram and is
paired with higher surrounding values (see the highlighted points in the histogram). This
location may be a local outlier. Further investigation should be made before deciding if the
value at that point is erroneous or in fact reflects a true characteristic of the phenomenon
and should be included as part of the model.

Local outlier

Looking for outliers through Voronoi mapping


Voronoi maps based on the cluster and entropy methods can be used to identify possible
outliers.

Entropy values provide a measure of dissimilarity between neighboring cells. In nature,


you would expect that things closer together are more likely to be more similar than things
farther apart. Therefore, local outliers may be identified by areas of high entropy.

The cluster method identifies those cells that are dissimilar to their surrounding neighbors.
You would expect the value recorded in a particular cell to be similar to at least one of its
neighbors. Therefore, this tool may be used to identify possible outliers.

Identifying global outliers using the Histogram tool

Pasos:

 Click the point or polygon feature layer in the ArcMap table of contents that you
want to explore.

 Click the Geostatistical Analyst arrow on the Geostatistical Analyst toolbar, click
Explore Data, then click Histogram.

If on the very left (the extreme minimum) or on the very right-hand side (the extreme
maximum) of the histogram you see an isolated bar, it may be indicating that the
point that the bar represents is an outlier. The more isolated from the main group of
bars of the histogram such a bar is, the larger the likelihood that the point is indeed
an outlier.

Identifying local outliers using the Semivariogram/Covariance Cloud tool

The Semivariogram/Covariance Cloud tool is useful for detecting local outliers. They
appear as points that are close together (low values on the x-axis) but are high on the y-
axis, indicating that the two points making up that pair have very different values. This is
contrary to what you would expect—namely, that points that are close together have
similar values.

Pasos:

 Click the point or polygon feature layer in the ArcMap table of contents that you
want to explore.

 Click the Geostatistical Analyst drop-down menu on the Geostatistical Analyst


toolbar, click Explore Data, then click Semivariogram/Covariance Cloud.
Identifying local outliers using the Voronoi Map tool

Pasos:

 Click the point or polygon feature layer in the ArcMap table of contents that you
want to explore.

 Click the Geostatistical Analyst drop-down menu on the Geostatistical Analyst


toolbar, click Explore Data, then click Voronoi Map.

When viewing the Voronoi Map, check whether at any vicinity there are polygons with the
colors symbolizing very different categories of values. It is data dependable, but in general,
a polygon surrounded by others with the values of classes separated by two other classes
should catch your attention as indicating a potential outlier.

Since all but the Simple type of the Voronoi Map are based on the neighborhood
calculation, the Simple Voronoi map should be examined in a search of outliers.

2.8. • Trend analysis

You may be interested in mapping a trend, or you might want to remove a trend from the
dataset when using kriging. The Trend Analysis tool can help identify trends in the input
dataset.

The Trend Analysis tool provides a three-dimensional perspective of the data. The
locations of sample points are plotted on the x,y plane. Above each sample point, the
value is given by the height of a stick in the z-dimension. A unique feature of the Trend
Analysis tool is that the values are then projected onto the x,z plane and the y,z plane as
scatterplots. This can be thought of as sideways views through the three-dimensional data.
Polynomials are then fit through the scatterplots on the projected planes. An additional
feature is that you can rotate the data to isolate directional trends. The tool also includes
other features that allow you to rotate and vary the perspective of the whole image,
change size and color of points and lines, remove planes and points, and select the order
of the polynomial that is to fit the scatterplots. By default, the tool will select second-order
polynomials to show trends in the data, but you may want to investigate polynomials of
order one and three to assess how well they fit the data.
Trend Analysis tool

2.8.1. Examining global trends through trend analysis

The Trend Analysis tool raises the points above a plot of the study site to the height of the
values of the attribute of interest in a three-dimensional plot of the study area. The points
are then projected in two directions (by default, north and west) onto planes that are
perpendicular to the map plane. A polynomial curve is fit to each projection. The entire
map surface can be rotated in any direction, which also changes the direction represented
by the projected planes. If the curve through the projected points is flat, no trend exists, as
shown by the blue line in the illustration below.
Trend analysis flat

If there is a definite pattern to the polynomial, such as an upward curve as shown by the
green line in the illustration above, this suggests that there is a trend in the data.

In the example below, the trend is accentuated, and it demonstrates a strong upside-down
U shape. This suggests that a second-order polynomial can be fit to the data. Through the
refinement allowed in the Trend Analysis tool, the true direction of the trend can be
identified. In this case, its strongest influence is from the center of the region toward all the
borders (that is, the highest values occur in the center of the region, and lower values
occur near the edges).
Trend analysis upside-down U shape

2.8.2. Looking for global trends

To identify a global trend in your data, look for a curve that is not flat on the projected
plane.

If you have a global trend in your data, you may want to create a surface using one of the
deterministic interpolation methods (for example, global or local polynomial), or you may
wish to remove the trend when using kriging.

Pasos:

 Click the point or polygon feature layer in the ArcMap table of contents that you
want to explore.
 Click the Geostatistical Analyst drop-down menu on the Geostatistical Analyst
toolbar, click Explore Data, then click Trend Analysis.

 On the Trend Analysis interface, click the Trend and Projections choice under the
Graph Options.

 Explore the bold lines on the vertical walls of the graph. These lines are indicating
trends. One trend line goes along the x-axis (typically showing the longitudinal
trend), while the other one shows the trend along the y-axis (typically the latitudinal
trend). It is very useful to change the Order of Polynomial while examining the
trends.

Sugerencia: It can be very helpful to check for trends in directions that vary from the
standard N–S and E–W. To enable such a view, rotate the trend axes by scrolling the
upper wheel on the right-hand side of the tool, just under the main display window.

2.8.3. Modeling global trends

A surface may be made up of two main components: a fixed global trend and random
short-range variation. The global trend is sometimes referred to as the fixed mean
structure. Random short-range variation (sometimes referred to as random error) can be
modeled in two parts: spatial autocorrelation and the nugget effect.

If you decide a global trend exists in your data, you must decide how to model it. Whether
you use a deterministic method or a geostatistical method to create a surface usually
depends on your objective. If you want to model just the global trend and create a smooth
surface, you may use a global or local polynomial interpolation method to create a final
surface. However, you may want to incorporate the trend in a geostatistical method (for
instance, remove the trend and model the remaining component as random short-range
variation). The main reason to remove a trend in geostatistics is to satisfy stationarity
assumptions. Trends should only be removed if there is justification for doing so.

If you remove the global trend in a geostatistical method, you will be modeling the random
short-range variation in the residuals. However, before making an actual prediction, the
trend will be automatically added back so that you obtain reasonable results.

If you deconstruct your data into a global trend plus short-range variation, you are
assuming that the trend is fixed and the short-range variation is random. Here, random
does not mean unpredictable, but rather that it is governed by rules of probability that
include dependence on neighboring values called autocorrelation. The final surface is the
sum of the fixed and random surfaces. That is, think of adding two layers, one that never
changes and another that changes randomly. For example, suppose you are studying
biomass. If you were to go back in time 1,000 years and start over to the present day, the
global trend of the biomass surface would be unchanged. However, the short-range
variation of the biomass surface would change. The unchanging global trend could be due
to fixed effects such as topography. Short-range variation could be caused by less
permanent features that could not be observed through time, such as precipitation, so it is
assumed it is random and likely to be autocorrelated.

If you can identify and quantify the trend, you will gain a deeper understanding of your data
and make better decisions. If you remove the trend, you will be able to more accurately
model the random short-range variation, because the global trend will not be influencing
kriging assumption about data stationarity.

2.9. Examining local variation

Voronoi maps are constructed from a series of polygons formed around the location of a
sample point.

Voronoi polygons are created so that every location within a polygon is closer to the
sample point in that polygon than any other sample point. After the polygons are created,
neighbors of a sample point are defined as any other sample point whose polygon shares
a border with the chosen sample point. For example, in the following figure, the bright
green sample point is enclosed by a polygon, which has been highlighted in red. Every
location within the red polygon is closer to the bright green sample point than to any other
sample point (given as small dark blue dots). The blue polygons all share a border with the
red polygon, so the sample points within the blue polygons are neighbors of the bright
green sample point.
Using this definition of neighbors, a variety of local statistics can be computed. For
example, a local mean is computed by taking the average of the sample points in the red
and blue polygons. This average is then assigned to the red polygon. This process is
repeated for all polygons and their neighbors, and the results are shown using a color
ramp to help visualize regions of high and low local values.

The Voronoi Map tool provides the following methods to assign or calculate values for the
polygons.

 Simple: The value assigned to a polygon is the value recorded at the sample point
within that polygon.

 Mean: The value assigned to a polygon is the mean value that is calculated from
the polygon and its neighbors.

 Mode: All polygons are categorized using five class intervals. The value assigned
to a polygon is the mode (most frequently occurring class) of the polygon and its
neighbors.

 Cluster: All polygons are categorized using five class intervals. If the class interval
of a polygon is different from each of its neighbors, the polygon is colored gray and
put into a sixth class to distinguish it from its neighbors.

 Entropy: All polygons are categorized using five classes based on a natural
grouping of data values (smart quantiles). The value assigned to a polygon is the
entropy that is calculated from the polygon and its neighbors—that is,

Entropy = - Σ (pi * Log pi ),

where pi is the proportion of polygons that are assigned to each class. For
example, consider a polygon surrounded by four neighbors (a total of five
polygons). The values are placed into the corresponding classes:

Class Frequency Pi
1 3 3/5
2 0 0
3 1 1/5
4 0 0
5 1 1/5
Entropy class/frequency

The entropy assigned to the polygon will be


E = -[0.6*log2 (0.6) + 0.2* log2 (0.2) + 0.2* log2 (0.2)] = 1.371

Minimum entropy occurs when the polygon values are all located in the same class. Then,

Emin = -[1 * log2 (1)] = 0

Maximum entropy occurs when each polygon value is located in a different class interval.
Then,

Emax = -[0.2 * log2 (0.2) + 0.2 * log2 (0.2) + 0.2 * log2 (0.2) + 0.2 * log2 (0.2) + 0.2 * log2 (0.2)] = 2.322

 Median: The value assigned to a polygon is the median value calculated from the
frequency distribution of the polygon and its neighbors.

 Standard Deviation: The value assigned to a polygon is the standard deviation


that is calculated from the polygon and its neighbors.

 Interquartile Range: The first and third quartiles are calculated from the frequency
distribution of a polygon and its neighbors. The value assigned to the polygon is the
interquartile range calculated by subtracting the value of the first quartile from the
value of the third quartile.

The Voronoi statistics can be used for different purposes and can be grouped into the
following general functional categories:

Functional category Voronoi statistics

Local Smoothing Mean, Mode, Median

Local Variation Standard deviation, Interquartile range, Entropy

Local Outliers Cluster

Local Influence Simple

2.10. Examining spatial autocorrelation

2.10.1. Examining spatial autocorrelation and directional variation

By exploring your data, you'll gain a better understanding of the spatial autocorrelation
among the measured values. This understanding can be used to make better decisions
when choosing models for spatial prediction.

Spatial autocorrelation
You can explore the spatial autocorrelation in your data by examining the different pairs of
sample locations. By measuring the distance between two locations and plotting the
difference squared between the values at the locations, a semivariogram cloud is created.
On the x-axis is the distance between the locations, and on the y-axis is the difference of
their values squared. Each dot in the semivariogram represents a pair of locations, not the
individual locations on the map.

If spatial correlation exists, pairs of points that are close together (on the far left of the x-
axis) should have less difference (be low on the y-axis). As points become farther away
from each other (moving right on the x-axis), in general, the difference squared should be
greater (moving up on the y-axis). Often there is a certain distance beyond which the
squared difference levels out. Pairs of locations beyond this distance are considered to be
uncorrelated.

A fundamental assumption for geostatistical methods is that any two locations that have a
similar distance and direction from each other should have a similar difference squared.
This relationship is called stationarity.

Spatial autocorrelation may depend only on the distance between two locations, which is
called isotropy. However, it is possible that the same autocorrelation value may occur at
different distances when considering different directions. Another way to think of this is that
things are more alike for longer distances in some directions than in other directions. This
directional influence is seen in semivariograms and covariances and is called anisotropy.

It is important to look for anisotropy so that if you detect directional differences in the
autocorrelation, you can account for them in the semivariogram or covariance models.
This in turn has an effect on the geostatistical prediction.

Exploring spatial structure through the Semivariogram/Covariance Cloud tool

The Semivariogram/Covariance Cloud tool can be used to investigate autocorrelation in


your dataset. Consider the ozone dataset. Notice in the figure below that you can select all
pairs of locations that are a certain distance apart by brushing all points at that distance in
the semivariogram cloud.
Autocorrelation

Looking for directional influences with the Semivariogram/Covariance Cloud tool

In the previous example, you used the Semivariogram/Covariance Cloud tool to look at the
general autocorrelation of the data. However, looking at the semivariogram surface, it
appears that there might be directional differences in the semivariogram values. When you
click Show search direction and set the angles and bandwidths as in the following figure,
you can see that the locations linked together have very similar values because the
semivariogram values are relatively low.
Directional variation

If you change the direction of the links as in the following figure, you can see that some
linked locations have values that are quite different, which result in the higher
semivariogram values. This indicates that locations separated by a distance of about
125,000 meters in the northeast direction are, on average, more different than locations in
the northwest direction. Recall that when variation changes more rapidly in one direction
than another, it is termed anisotropy. When interpolating a surface using the Geostatistical
Analyst wizard, you can use semivariogram models that account for anisotropy.
Modified directional variation

2.10.2. The Semivariogram/Covariance Cloud tool

The Semivariogram/Covariance Cloud tool shows the empirical semivariogram and


covariance values for all pairs of locations within a dataset and plots them as a function of
the distance that separates the two locations, as in the example shown below:
Semivariogram covariance cloud illustration

The Semivariogram/Covariance Cloud tool can be used to examine the local


characteristics of spatial autocorrelation within a dataset and look for local outliers. The
semivariogram cloud looks something like this:
Semivariogram cloud example

In the illustration above, each red dot shows the empirical semivariogram value (the
squared difference between the values of two data points making up a pair) plotted against
the distance separating the two points. You can brush the dots and see the linked pairs of
points in ArcMap.

A Semivariogram Surface with Search Direction capabilities is shown below. The values in
the semivariogram cloud are put into bins based on the direction and distance between a
pair of locations. These binned values are then averaged and smoothed to produce the
semivariogram surface. In the figure below, the legend shows the values between color
transitions. In this tool, you can input a lag size to control the size of the bins, and the
number of bins is determined by the number of lags you specify. The extent of the
semivariogram surface is controlled by lag size and number of lags.

Semivariogram Surface with Search Direction

You can view subsets of values in the semivariogram cloud by checking the Show search
direction box and clicking the direction controller to resize it or change its orientation.
You select the dataset and attribute using the following controls:

2.10.3. Examining spatial autocorrelation and directional variation


using the Semivariogram/Covariance Cloud tool

Examining spatial structure allows you to investigate the existence of spatial


autocorrelation of the sample data and explore whether there are any directional
influences.

Pairs of points that are close together (to the left on the x-axis in the semivariogram)
should be more alike (low on the y-axis) than those that are farther apart (moving to the
right on the x-axis).

If the pairs of points in the semivariogram produce a horizontal straight line, there may be
no spatial correlation in the data and it would be meaningless to interpolate the data.

Pasos:

 Click the point or polygon feature layer in the ArcMap table of contents that you
want to explore.

 Click the Geostatistical Analyst arrow on the Geostatistical Analyst toolbar, click
Explore Data, then click Semivariogram/Covariance Cloud.

2.10.4. The Crosscovariance Cloud tool

The Crosscovariance cloud shows the empirical crosscovariance for all pairs of locations
between two datasets and plots them as a function of the distance between the two
locations, as in the example shown below:
Crosscovariance cloud illustration

The Crosscovariance cloud can be used to examine the local characteristics of spatial
correlation between two datasets, and it can be used to look for spatial shifts in correlation
between two datasets. A crosscovariance cloud looks something like this:
Crosscovariance cloud example

In the illustration above, each red dot shows the empirical crosscovariance between a pair
of locations, with the attribute of one point taken from the first dataset and the attribute of
the second point taken from the second dataset. You can brush dots and see the linked
pairs in ArcMap. (To differentiate which of the pairs came from which dataset, set a
different selection color in the Properties dialog box of each dataset.)

A covariance surface with search direction capabilities is also provided in the tool. The
values in the crosscovariance cloud are put into bins based on the direction and distance
separating a pair of locations. These binned values are then averaged and smoothed to
produce a crosscovariance surface. The legend shows the colors and values separating
classes of covariance value. The extent of the crosscovariance surface is controlled by lag
size and number of lags that you specify.
Crosscovariance surface

Directional controller

You can view subsets of values in the Crosscovariance cloud by checking the Show
search direction option and dragging the directional controller to resize it or change its
orientation.

You select datasets and attributes using the following controls:

Datasets and attributes fields

In this case, OZONE is the attribute field storing the ozone concentrations, and NO2AAM
is the attribute field storing the level of nitrogen dioxide.

You can click the arrow button Hide beside Data Sources to temporarily hide this part of
the tool.

2.10.5. Examining covariation among multiple datasets

The Crosscovariance Cloud tool can be used to investigate cross-correlation between two
datasets. Consider ozone (dataset 1) and NO 2 (dataset 2). Notice that the cross-
correlation between NO2 and ozone seems to be asymmetric. The red area shows that the
highest correlation between both datasets occurs when taking NO2 values that are shifted
to the west of the ozone values. The Search Direction tool will help identify the reasons for
this. When it is pointed toward the west, this is shown:
Search Direction tool (west)

Search Direction tool (west)

When it is pointed toward the east, this is shown:

Search Direction tool (east)


Search Direction tool (east)

It is clear that there are higher covariance values when the Search Direction tool is pointed
toward the west. You can use the Crosscovariance Cloud and Histogram tools to examine
which pairs contribute the highest cross-covariance values. If you use the Search Direction
tool pointed in the west direction and brush some of the high cross-covariance points in
the cloud, you can see that most of the corresponding data points are located in central
California. You can also see that the NO2 values are shifted to the west of the ozone
values. The histograms show that the high covariance values occur because both NO 2
(blue bars in the NO2 histogram) and ozone (orange bars in the ozone histogram) values
for the selected points are above the mean NO 2 and ozone values, respectively. From this
analysis, you have learned that much of the asymmetry in the cross-covariance is due to a
shift occurring because high NO2 values occur to the west of the high ozone values.

You could also obtain high cross-covariance values whenever the pairs selected from both
datasets have values that are below their respective means. In fact, you would expect to
see high cross-covariance values from pairs of locations that are both above and below
their respective means, and these would occur in several regions within the study area. By
exploring the data, you can identify that the cross-covariance in central California seems to
be different from that in the rest of the state. Based on this information, you might decide
that the results from Crosscovariance Cloud are due to a nonconstant mean in the data
and try to remove trends from both NO2 and ozone.

Crosscovariance analysis results

2.10.6. Examining covariation among multiple datasets using the


Crosscovariance Cloud tool

The Crosscovariance Cloud tool shows the covariance between two datasets. Use the
Search Direction tool to examine the cloud. Check if the covariance surface is symmetric
and the cross-covariance values are similar in all directions.

If you see that there is a spatial shift in the values of two datasets or unusually high cross-
covariance values, you can investigate where these occur. If you note that unusual cross-
covariance values occur for isolated locations or within restricted areas of your study site,
you may want to take some action, such as investigating those data values, detrending
data, or splitting the data into different strata before interpolating it.
Pasos:

 Right-click the point feature layer identifying the first layer in the cross-covariance
analysis in the ArcMap table of contents and click Properties.

 Click the Selection tab.

 Click the symbol button (located under the with this symbol option).

 Choose a color and size for the selection.

 Click OK and then OK again.

 Repeat steps 1–5 for the second layer to be used in the cross-covariance analysis,
but choose different selection sizes and colors.

 Highlight the layers in the ArcMap table of contents by holding down CTRL while
clicking the two layers.

 Click Geostatistical Analyst > Explore Data > Crosscovariance Cloud.

 Choose the appropriate attribute for each layer in the Attribute list.

 Check the Show search direction option.

 Click the center line of the Search Direction tool and drag it until it points to the
angle where you believe there is a shift.

 Brush some points in the covariance cloud by clicking and dragging over some of
the red points. Examine where the pairs of points are on the ArcMap map.

3. Creating surfaces

3.1. Geostatistical Analyst example applications

The Geostatistical Analyst addresses a wide range of different application areas. The
following is a small sampling of applications in which Geostatistical Analyst was used.

3.1.1. Exploratory spatial data analysis

Using measured sample points from a study area, Geostatistical Analyst was used to
create accurate predictions for other unmeasured locations within the same area.
Exploratory spatial data analysis tools included with Geostatistical Analyst were used to
assess the statistical properties of data such as spatial data variability, spatial data
dependence, and global trends.

A number of exploratory spatial data analysis tools were used in the example below to
investigate the properties of ozone measurements taken at monitoring stations in the
Carpathian Mountains.

Application of Geostatistical Analyst for the Carpathian Mountains

3.1.2. Semivariogram modeling

Geostatistical analysis of data occurs in the following phases:

 Modeling the semivariogram or covariance to analyze surface properties

 Kriging

A number of kriging methods are available for surface creation in Geostatistical Analyst,
including ordinary, simple, universal, indicator, probability, and disjunctive kriging.

The following illustrates the two phases of geostatistical analysis of data. First, the
Semivariogram/Covariance wizard was used to fit a model to winter temperature data for
the United States. This model was then used to create the temperature distribution map.
Geostatistical Analyst application for winter temperature

3.1.3. Surface prediction and error modeling

Various types of map layers can be produced using Geostatistical Analyst, including
prediction maps, quantile maps, probability maps, and prediction standard error maps.

The following shows Geostatistical Analyst used to produce a prediction map of


radiocesium soil contamination levels in the country of Belarus after the Chernobyl nuclear
power plant accident
Geostatistical Analyst application for radiocesium contamination

3.1.4. Threshold mapping

Probability maps can be generated to predict where values exceed a critical threshold.

In the example below, locations shown in dark orange and red indicate a probability
greater than 62.5 percent that radiocesium contamination exceeds the upper permissible
level (critical threshold) in forest berries.

Geostatistical Analyst application for radiocesium contamination threshold


3.1.5. Model validation and diagnostics

Input data can be split into two subsets. The first subset of the available data can be used
to develop a model for prediction. The predicted values are then compared with the known
values at the remaining locations using the Validation tool.

The following shows the Validation wizard used to assess a model developed to predict
organic matter for a farm in Illinois.

Geostatistical Analyst application for organic matter in Illinois

3.1.6. Surface prediction using cokriging

Cokriging, an advanced surface modeling method included in Geostatistical Analyst, can


be used to improve surface prediction of a primary variable by taking into account
secondary variables, provided that the primary and secondary variables are spatially
correlated.

In the following example, exploratory spatial data analysis tools are used to explore spatial
correlation between ozone (primary variable) and nitrogen dioxide (secondary variable) in
California. Because the variables are spatially correlated, cokriging can use the nitrogen
dioxide data to improve predictions when mapping ozone.
Geostatistical Analyst application for ozone in California

3.2. Key concepts for all interpolation methods

3.2.1. Analyzing the surface properties of nearby locations

Generally speaking, things that are closer together tend to be more alike than things that
are farther apart. This is a fundamental geographic principle. Suppose you are a town
planner and need to build a scenic park in your town. You have several candidate sites,
and you may want to model the viewsheds at each location. This will require a more
detailed elevation surface dataset for your study area. Suppose you have preexisting
elevation data for 1,000 locations throughout the town. You can use this to build a new
elevation surface.

When trying to build the elevation surface, you can assume that the sample values closest
to the prediction location will be similar. But how many sample locations should you
consider? Should all of the sample values be considered equally? As you move farther
away from the prediction location, the influence of the points will decrease. Considering a
point too far away may actually be detrimental because the point may be located in an
area that is dramatically different from the prediction location.

One solution is to consider enough points to give a good prediction, but few enough points
to be practical. The number will vary with the amount and distribution of the sample points
and the character of the surface. If the elevation samples are relatively evenly distributed
and the surface characteristics do not change significantly across your landscape, you can
predict surface values from nearby points with reasonable accuracy. To account for the
distance relationship, the values of closer points are usually weighted more heavily than
those farther away. This principle is common to all the interpolation methods offered in
Geostatistical Analyst (except for global polynomial interpolation, which assigns equal
weights to all points).

3.2.2. Search neighborhoods

You can assume that as locations get farther from the prediction location, the measured
values have less spatial autocorrelation with the prediction location. As these points have
little or no effect on the predicted value, they can be eliminated from the calculation of that
particular prediction point by defining a search neighborhood. It is also possible that distant
locations may have a detrimental influence on the predicted value if they are located in an
area that has different characteristics than those of the prediction location. A third reason
to use search neighborhoods is for computational speed. If you have 2,000 data locations,
the matrix would be too large to invert, and it would not be possible to generate a predicted
value. The smaller the search neighborhood, the faster the predicted values can be
generated. As a result, it is common practice to limit the number of points used in a
prediction by specifying a search neighborhood.

The specified shape of the neighborhood restricts how far and where to look for the
measured values to be used in the prediction. Additional parameters restrict the locations
that are used within the search neighborhood. The search neighborhood can be altered by
changing its size and shape or by changing the number of neighbors it includes.

The shape of the neighborhood is influenced by the input data and the surface that you are
trying to create. If there are no directional influences in the spatial autocorrelation of your
data (see Accounting for directional influences for more information), you will want to use
points equally in all directions, and the shape of the search neighborhood is a circle.
However, if there is directional autocorrelation or a trend in the data, you may want the
shape of your neighborhood to be an ellipse oriented with the major axis parallel to the
direction of long-range autocorrelation (the direction in which the data values are most
similar).

The search neighborhood can be specified in the Geostatistical Wizard, as shown in the
following example:

 Neighborhood type: Standard

 Maximum neighbors = 4

 Minimum neighbors = 2

 Sector type (search strategy): Circle with four quadrants with 45° offset; radius =
182955.6

 Coordinates of test point (x = -2084032, y = 89604.57)

 Predicted value= 0.08593987


Searching neighborhood step

The Weights section lists the weights that are used to estimate the value at the location
marked by the crosshair on the preview surface. The data points with the largest weights
are highlighted in red.

Once a neighborhood shape is specified, you can restrict which locations within the shape
should be used. You can define the maximum and minimum number of neighbors to
include and divide the neighborhood into sectors to ensure that you include values from all
directions. If you divide the neighborhood into sectors, the specified maximum and
minimum number of neighbors is applied to each sector.

There are several different sector types that can be used:


 One sector

 Ellipse with four sectors

 Ellipse with four sectors and a 45-degree offset (selected)

 Eight sectors

Kriging uses the data configuration specified by the search neighborhood in conjunction
with the fitted semivariogram model; weights for the measured locations can be
determined. Using the weights and the measured values, a prediction can be made for the
prediction location. This process is performed for each location within the study area to
create a continuous surface. Other interpolation methods follow the same process, but the
weights are determined using techniques that do not involve a semivariogram model.

The Smooth Interpolation option creates three ellipses. The central ellipse uses the Major
semiaxis and Minor semiaxis values. The inner ellipse uses these semiaxis values
multiplied by 1 minus the value for Smoothing factor, whereas the outer ellipse uses the
semiaxis values multiplied by 1 plus the smoothing factor. All the points within these three
ellipses are used in the interpolation. Points inside the smallest ellipse have weights
assigned to them in the same ways as for standard interpolation (for example, if the
method being used is inverse distance weighted interpolation, the points within the
smallest ellipse are weighted based on their distance from the prediction location). The
points that fall between the smallest ellipse and the largest ellipse get weights as
described for the points falling inside the smallest ellipse, but then the weights are
multiplied by a sigmoidal value that decreases from 1 (for points located just outside the
smallest ellipse) to 0 (for points located just outside the largest ellipse). Data points outside
the largest ellipse have zero weight in the interpolation. An example of this is shown
below:
Geostatistical wizard showing weights for data points

The exceptions to the above descriptions are as follows:

 Areal interpolation, which only supports one sector.

 Empirical Bayesian kriging, which requires a circular search neighborhood;


therefore, Major semiaxis and Minor semiaxis have been replaced with Radius. The
value of the radius represents the length of the radius of the searching circle.

In Geostatistical Analyst, the weights for all nonkriging models are defined by a priori
analytic functions based on the distance from the prediction location. Most kriging models
predict a value using the weighted sum of the values of the nearby locations. Kriging uses
the semivariogram to define the weights that determine the contribution of each data point
to the prediction of new values at unsampled locations. Because of this, the default search
neighborhood used in kriging is constructed using the major and minor ranges of the
semivariogram model.
It is expected that a continuous surface is made from continuous data, such as
temperature observations, for example. However, all interpolators with a local searching
neighborhood generate predictions (and prediction standard errors) that can be
substantially different for nearby locations if the local neighborhoods are different. To see a
graphical representation of why this occurs, see Smooth interpolation.

3.2.3. Altering the search neighborhood by changing the number of


neighbors

The neighborhood search size defines the neighborhood shape and the constraints of the
points within the neighborhood that will be used in the prediction of an unmeasured
location.

You set the neighborhood parameters by looking for the locations of the points in the data
view window and using prior knowledge gained in ESDA and semivariogram/covariance
modeling.

The following are tips for altering the search neighborhood by changing the number of
neighbors:

 Each sector will be projected outward if the minimum number of points is not found
inside the sector.

 If there are no points within the searching neighborhood, then for most of the
interpolation methods, it will mean that a prediction cannot be made at that
location.

 Although some interpolators, such as simple and disjunctive kriging, predict values
in areas without data points using the mean value of the dataset, a common
practice is to change the searching neighborhood so that some points are located
in the searching neighborhood.

Use the step below as a guide to changing the search neighborhood for any of the
interpolation methods offered in Geostatistical Wizard (except for global polynomial
interpolation). The step applies once you have defined the interpolation method and data
you want to use and have advanced through the wizard until you have reached the
Searching Neighborhood window.

Pasos:
 Limit the number of neighbors to use for the prediction by changing the Maximum
neighbors and Minimum neighbors parameters.

These parameter control the number of neighbors included in each sector of the search
neighborhood. The number and orientation of the sectors can be changed by altering the
Sector type parameter.

Sugerencia: The impact of the search neighborhood can be assessed using the cross-
validation and comparison tools that are available in Geostatistical Analyst. If necessary,
the search neighborhood can be redefined and another surface created.

3.2.4. Altering the search neighborhood by changing its size and shape

The neighborhood search size defines the neighborhood shape and the constraints of the
points within the neighborhood that will be used in the prediction of an unmeasured
location.

You set the neighborhood parameters by looking for the locations of the points in the data
view window and using prior knowledge gained in ESDA and semivariogram/covariance
modeling.
Use the steps below as a guide to changing the search neighborhood for any of the
interpolation methods offered in Geostatistical Wizard (except for global polynomial
interpolation).

Pasos:

 Sector type is used to alter the type of searching neighborhood by choosing from a
predefined list.

 Type an Angle value between 0 and 360 degrees or choose from the drop-down
list. This is the orientation of the major semiaxis, measured in degrees from north.

 The Major semiaxis and Minor semiaxis parameters are used to alter the shape of
the ellipse. The desired shape appears in the display window once these values
are entered.

3.2.5. Altering the map view in Geostatistical Wizard

The preview map is provided for all interpolation methods in the wizard, and it can be
manipulated using the controls in the toolbar above it.

Pasos:
 Click the Zoom In button above the map view, then drag a box around the area
of the map on which the zoom will occur.

 Click the Zoom Out button above the map view, then drag a box around the
area of the map on which the zoom will occur.

 Click the Pan button and move the pointer into the map display, click and hold,
then move the pointer to pan around the map display. The map moves in
coordination with the pointer.

 Click the Full Extent button to display the map using the full extent.

 Click the Show Layers button and choose which features to display.

 Click the Back arrow or the Forward arrow to display the previous or next
extent.

 Click the Change points size button to change the symbology of the input
points.

 Click the Show Legend button to toggle between showing and hiding the map
legend.

 Click the Identify value button and click a location on the map to make a
prediction and to highlight which points are used to make this prediction.

3.2.6. Determining the prediction for a specific location


Tabla de contenido

1. Introduction to Geostatistical Analisis........................................................................................1


1.1. What is geostatistics?.........................................................................................................1
1.2. The geostatistical workflow................................................................................................2
1.3. What is the Extensión ArcGIS Geostatistical Analyst?.........................................................4
1.4. Essential vocabulary for Geostatistical Analyst...................................................................5
1.5. A quick tour of Geostatistical Analyst.................................................................................7
Exploratory spatial data analysis graphs..................................................................................8
Tools for exploring a single dataset................................................................................................9
Tools for exploring relationships between datasets.......................................................................9
Geostatistical Wizard...................................................................................................................10
Deterministic methods.................................................................................................................11
Geostatistical methods.................................................................................................................12
Geostatistical Analyst toolbox......................................................................................................13
Subset Features............................................................................................................................14
The process of building an interpolation model...........................................................................15
2. Choosing the right method.......................................................................................................15
2.1. An introduction to interpolation methods........................................................................15
2.2. Classification trees of the interpolation methods offered in Geostatistical Analyst.........17
2.3. Examining and understanding your data..........................................................................22
The importance of knowing your data.........................................................................................22
2.4. Map the data....................................................................................................................23
2.5. Exploratory Spatial Data Analysis......................................................................................28
Quantitative data exploration......................................................................................................28
The ESDA tools.............................................................................................................................29
Working with the ESDA tools: brushing and linking.....................................................................29
2.6. • Data distributions and transformations.........................................................................30
Examine the distribution of your data..........................................................................................30
Examining the distribution of your data using histograms and normal QQ plots.........................31
Histograms...................................................................................................................................32
Normal QQ plot and general QQ plot...........................................................................................38
Data transformations...................................................................................................................43
Using Box-Cox, arcsine, and log transformations.........................................................................45
Normal score transformation.......................................................................................................46
Using normal score transformations............................................................................................48
Comparing normal score transformations to other transformations...........................................49
2.7. • Global and Local outliers................................................................................................49
Looking for global and local outliers.............................................................................................50
2.8. • Trend analysis................................................................................................................55
Examining global trends through trend analysis..........................................................................56
Looking for global trends..............................................................................................................58
Modeling global trends................................................................................................................59
2.9. Examining local variation..................................................................................................60
2.10. Examining spatial autocorrelation....................................................................................62
Examining spatial autocorrelation and directional variation........................................................62
The Semivariogram/Covariance Cloud tool..................................................................................66
Examining spatial autocorrelation and directional variation using the Semivariogram/Covariance
Cloud tool.....................................................................................................................................69
The Crosscovariance Cloud tool...................................................................................................69
Examining covariation among multiple datasets..........................................................................72
Examining covariation among multiple datasets using the Crosscovariance Cloud tool..............75
3. Creating surfaces......................................................................................................................76
3.1. Geostatistical Analyst example applications.....................................................................76
Exploratory spatial data analysis..................................................................................................76
Semivariogram modeling.............................................................................................................77
Surface prediction and error modeling........................................................................................78
Threshold mapping......................................................................................................................79
Model validation and diagnostics.................................................................................................80
Surface prediction using cokriging...............................................................................................80
3.2. Key concepts for all interpolation methods......................................................................81
Analyzing the surface properties of nearby locations..................................................................81
Search neighborhoods.................................................................................................................82
Altering the search neighborhood by changing the number of neighbors...................................87
Altering the search neighborhood by changing its size and shape...............................................88
Altering the map view in Geostatistical Wizard............................................................................89
Determining the prediction for a specific location.......................................................................90
1. Introducción al Análisis Geoestadístico....................................................................................94
1.1. ¿Qué es la Geoestadística?...............................................................................................94
1.2. El flujo de trabajo geoestadístico......................................................................................95
1.3. ¿Qué es la extensión del análisis geoestadístico de ArcGIS?.............................................96
1.4. Vocabulario esencial para el análisis geoestadístico.........................................................96
1.5. Un tour rápido del análisis geoestadístico........................................................................97
Gráficos de análisis exploratorios de datos espaciales.................................................................97
Herramientas para la exploración de un conjunto de datos........................................................97
Herramientas para la exploración de las relaciones entre conjuntos de datos............................97
Asistente geoestadístico...............................................................................................................97
Métodos deterministas................................................................................................................97
Métodos geoestadísticos.............................................................................................................97
Caja de herramientas de análisis geoestadístico..........................................................................97
Características de los subconjuntos.............................................................................................97
Aná lisis Geoestadístico
1. Introducción al Análisis Geoestadístico

3.3. ¿Qué es la Geoestadística?

La Geoestadística es una clase de estadísticas utilizada para analizar y predecir los


valores asociados a fenómenos espaciales o espaciotemporales. Incorpora las
coordenadas espaciales (y en algunos casos las espaciales) de los datos en el análisis.
Muchas herramientas geoestadísticas fueron desarrolladas inicialmente como medios
prácticos para describir patrones espaciales e interpolar valores para aquellas
localizaciones donde las muestras no fueron tomadas. Esas herramientas y métodos han
evolucionado desde ese entonces no sólo para proveer valores interpolados, sino también
para entregar medidas de incertidumbre de dichos valores. La medición de la
incertidumbre es crítica para la toma informada de decisiones, ya que proporciona
información sobre los posibles valores (resultados) para cada ubicación en lugar de sólo
un valor interpolado. El análisis geoestadístico ha evolucionado también desde una a
múltiples variables y ofrece mecanismos para incorporar bases de datos secundarias que
complementan las (posiblemente escasas) variables primarias de interés, permitiendo así
la construcción de interpolaciones más exactas y modelos de incertidumbre.

Las geoestadísticas son ampliamente utilizadas en diversas áreas de la ciencia y la


ingeniería, como por ejemplo:

 La industria minera utiliza la geoestadística para diversos aspectos de un proyecto:


inicialmente para cuantificar los recursos minerales y evaluar la factibilidad
económica de los proyectos para luego, sobre una base diaria decidir que material
es trasladado a la planta y cual es desechado, utilizando información actualizada a
medida que ésta esté disponible.

 En las ciencias medioambientales, la geoestadística es utilizada para estimar


niveles de polución con el fin de decidir si suponen una amenaza para la salud
ambiental o humana y garantizar su mitigación.

 Aplicaciones relativamente nuevas en el campo de la ciencia de los suelos se


enfocan en el mapeo de los niveles de nutrientes presentes en éstos (nitrógeno,
fósforo, potasio, entre otros) así como otros indicadores (como conectividad
eléctrica) con el fin de estudiar sus relaciones, adecuar los rendimientos y
prescribir la cantidad precisa de fertilizantes requerida por cada ubicación dentro
del campo.

 Las aplicaciones meteorológicas incluyen la predicción de temperaturas,


precipitaciones y variables asociadas (como lluvia ácida).

 Recientemente, ha habido numerosas aplicaciones de geoestadística en el área de


la salud pública, por ejemplo prediciendo niveles de contaminación ambiental y su
relación con las tasas de incidencia de cáncer.

En todos estos ejemplos, el contexto general incluye la existencia de algún fenómeno


de interés en el territorio (nivel de polución del suelo, agua o aire por un
contaminante; contenido de oro u otro mineral en una mina, entre otros). La
realización de estudios exhaustivos es cara e incluye la inversión de tiempo, por lo
que el fenómeno es generalmente caracterizado mediante la toma de muestras en
diferentes ubicaciones. La geoestadística es usada por lo tanto para la generación
de predicciones (y de las mediciones asociadas a los modelos de incertidumbre de
dichas predicciones) para los lugares no muestreados. Un flujo de trabajo general
para los estudios geoestadísticos es descrito en Flujo de trabajo geoestadístico

3.4. El flujo de trabajo geoestadístico

En este punto, se presenta un flujo de trabajo generalizado para estudios estadísticos y se


explican sus pasos principales. Como se menciona en ¿Qué es la Geoestadística?, ésta
es una clase de estadísticas utilizada para analizar y predecir los valores asociados con
los fenómenos espaciales o espaciotemporales. El módulo de análisis geoestadístico de
ArcGIS provee un set de herramientas que permiten la construcción de modelos utilizando
coordenadas espaciales (y temporales). Estos modelos pueden ser aplicados en una
variedad de situaciones y son usualmente aplicados en la estimación de valores para
ubicaciones no muestreadas, así como para la medición de niveles de incertidumbre de
dichas predicciones.
Modelo Geoestadístico

Mapear y Pre-procesamiento de los datos si es necesario


analizar los (transformar, DETREND, DECLUSTER)
datos.

Estructura de modelo Definición de estrategia de


espacial. búsqueda.

Predicción de valores en las zonas no Cuantificación de incertidumbre de las


muestreadas. predicciones.

N Revisar que el modelo genera resultados razonables tanto para las


o predicciones como para los niveles de incertidumbre.

Utilizar la información en análisis de riesgos y


SI toma de decisiones.

El primer paso, como en casi cualquier estudio de manejo de datos, es el examinar de


cerca dichos datos. Esto generalmente comienza con el mapeo del conjunto de datos,
utilizando un esquema de clasificación y color que permita una clara visualización de las
características más importantes presentes en los datos, como por ejemplo, un fuerte
incremento en los valores desde el norte al sur (ver Análisis de tendencias4.8), una
mezcla de valores altos y bajos sin una disposición particular (posiblemente un signo de
que los datos han sido tomados a una escala que no posibilita la muestra de correlación
espacial; ver 4.10.1Examinando la autocorrelación espacial y la variación direccional); o
zonas que han sido muestreadas más densamente (muestreo preferencial) y que puede
liderar la decisión para el uso de pesos desagrupados en el análisis de los datos (ver
XXXXXXXX para una discusión más detallada respecto de esquemas de mapeo y
clasificación).

La segunda etapa consiste en construir el modelo geoestadístico. Éste proceso puede


implicar múltiples sub-procesos, dependiendo de los objetivos de estudio (es decir, el o
los tipos de información que debiese proveer el modelo) y de las características del
conjunto de datos considerados lo suficientemente importantes como para ser incluidos
en el estudio. En esta etapa, la información recolectada durante una rigurosa exploración
del conjunto de tatos y los conocimientos previos del fenómeno a estudiar determinarán
cuan complejo será el modelo y cuan buenos serán los valores interpolados y los niveles
de incertidumbre.

En la figura superior, el modelo de construcción puede involucrar el procesamiento de los


datos para remover tendencias espaciales, las que serán modeladas por separado y
agregadas posteriormente en el paso final del proceso de interpolación (ver Análisis de
tendencias), transformando los datos para que sigan más de cerca una distribución
Gaussiana (la que es requrida para los resultados de algunos métodos y modelos, ver
Examinando la distribución de sus datos), y desagrupando el conjunto de datos para
compensar el muestreo preferencial. Mientras mucha información puede ser derivada
mediante el examen del conjunto de datos, es importante incorporar cualquier
conocimiento que pudiera tener respecto del fenómeno

3.5. ¿Qué es la extensión del análisis geoestadístico de ArcGIS?

3.6. Vocabulario esencial para el análisis geoestadístico

3.7. Un tour rápido del análisis geoestadístico

3.7.1. Gráficos de análisis exploratorios de datos espaciales

3.7.2. Herramientas para la exploración de un conjunto de datos


3.7.3. Herramientas para la exploración de las relaciones entre conjuntos
de datos

3.7.4. Asistente geoestadístico

3.7.5. Métodos deterministas

3.7.6. Métodos geoestadísticos

3.7.7. Caja de herramientas de análisis geoestadístico

3.7.8. Características de los subconjuntos

3.7.9. El proceso de construcción de un modelo de interpolación

4. Eligiendo el método correcto

4.1. Introducción a los métodos de interpolación

4.2. Árbol de clasificaciones de los métodos de interpolación ofrecidos en


análisis geoestadístico
4.3. Examinando y entendiendo sus datos

4.3.1. La importancia de conocer sus datos

4.4. Mapeando los datos

4.5. Exploración del análisis de datos espaciales

4.5.1. Exploración de datos cuantitativos

4.5.2. Las herramientas ESDA

4.5.3. Trabajando con las herramientas esda: cepillada y unión

4.6. Distribución y transformación de datos

4.6.1. Examinando la distribución de sus datos

4.6.2. Examinando la distribución de sus datos mediante histogramas y


argumentos de QQ normales

4.6.3. Histogramas
4.6.4. Argumentos de QQ normales y generales

4.6.5. Transformación de datos

4.6.6. Usando Box-Cox, arcsine y registro de transformaciones

4.6.7. Transformaciones de cuenta normales

4.6.8. Usando transformaciones de cuenta normales

4.6.9. Comparando transformaciones de cuenta normales con otras


transformaciones

4.7. Valores atípicos locales y globales

4.7.1. Buscando valores atípicos locales y globales

4.8. Análisis de tendencias

4.8.1. Examinando tendencias globales mediante el análisis de tendencias

4.8.2. Buscando tendencias globales


4.8.3. Modelando tendencias globales

4.9. Examinando las variaciones locales

4.10. Examinando la autocorrelación espacial

4.10.1. Examinando la autocorrelación espacial y la variación


direccional

4.10.2. La herramienta de nube de semivariograma / covarianza

You might also like