Geostatistical Análisis
Geostatistical Análisis
Geostatistics is a class of statistics used to analyze and predict the values associated with
spatial or spatiotemporal phenomena. It incorporates the spatial (and in some cases
temporal) coordinates of the data within the analyses. Many geostatistical tools were
originally developed as a practical means to describe spatial patterns and interpolate
values for locations where samples were not taken. Those tools and methods have since
evolved to not only provide interpolated values, but also measures of uncertainty for those
values. The measurement of uncertainty is critical to informed decision making, as it
provides information on the possible values (outcomes) for each location rather than just
one interpolated value. Geostatistical analysis has also evolved from uni- to multivariate
and offers mechanisms to incorporate secondary datasets that complement a (possibly
sparse) primary variable of interest, thus allowing the construction of more accurate
interpolation and uncertainty models.
Geostatistics is widely used in many areas of science and engineering, for example:
The mining industry uses geostatistics for several aspects of a project: initially to
quantify mineral resources and evaluate the project's economic feasibility, then on
a daily basis in order to decide which material is routed to the plant and which is
waste, using updated information as it becomes available.
Relatively new applications in the field of soil science focus on mapping soil
nutrient levels (nitrogen, phosphorus, potassium, and so on) and other indicators
(such as electrical conductivity) in order to study their relationships to crop yield
and prescribe precise amounts of fertilizer for each location in the field.
In all of these examples, the general context is that there is some phenomenon of interest
occurring in the landscape (the level of contamination of soil, water, or air by a pollutant;
the content of gold or some other metal in a mine; and so forth). Exhaustive studies are
expensive and time consuming, so the phenomenon is usually characterized by taking
samples at different locations. Geostatistics is then used to produce predictions (and
related measures of uncertainty of the predictions) for the unsampled locations. A
generalized workflow for geostatistical studies is described in The geostatistical workflow.
In this topic, a generalized workflow for geostatistical studies is presented, and the main
steps are explained. As mentioned in What is geostatistics, geostatistics is class of
statistics used to analyze and predict the values associated with spatial or spatiotemporal
phenomena. ArcGIS Geostatistical Analyst provides a set of tools that allow models that
use spatial (and temporal) coordinates to be constructed. These models can be applied to
a wide variety of scenarios and are typically used to generate predictions for unsampled
locations, as well as measures of uncertainty for those predictions.
The first step, as in almost any data-driven study, is to closely examine the data. This
typically starts by mapping the dataset, using a classification and color scheme that allow
clear visualization of important characteristics that the dataset might present, for example,
a strong increase in values from north to south (Trend—see Trend_analysis); a mix of high
and low values in no particular arrangement (possibly a sign that the data was taken at a
scale that does not show spatial correlation—see
Examining_spatial_structure_and_directional_variation); or zones that are more densely
sampled (preferential sampling) and may lead to the decision to use declustering weights
in the analysis of the data—see Implementing_declustering_to_adjust_for_preferential
sampling. See Map the data for a more detailed discussion on mapping and classification
schemes.
The second stage is to build the geostatistical model. This process can entail several
steps, depending on the objectives of the study (that is, the type(s) of information the
model is supposed to provide) and the features of the dataset that have been deemed
important enough to incorporate. At this stage, information collected during a rigorous
exploration of the dataset and prior knowledge of the phenomenon determine how
complex the model is and how good the interpolated values and measures of uncertainty
will be. In the figure above, building the model can involve preprocessing the data to
remove spatial trends, which are modeled separately and added back in the final step of
the interpolation process (see Trend_analysis); transforming the data so that it follows a
Gaussian distribution more closely (required by some methods and model outputs—see
About_examining_the_distribution_of_the_data); and declustering the dataset to
compensate for preferential sampling. While a lot of information can be derived by
examining the dataset, it is important to incorporate any knowledge you might have of the
phenomenon. The modeler cannot rely solely on the dataset to show all the important
features; those that do not appear can still be incorporated into the model by adjusting
parameter values to reflect an expected outcome. It is important that the model be as
realistic as possible in order for the interpolated values and associated uncertainties to be
accurate representations of the real phenomenon.
In addition to preprocessing the data, it may be necessary to model the spatial structure
(spatial correlation) in the dataset. Some methods, like kriging, require this to be explicitly
modeled using semivariogram or covariance functions (see
Semivariograms_and_covariance_functions); whereas other methods, like Inverse
Distance Weighting, rely on an assumed degree of spatial structure, which the modeler
must provide based on prior knowledge of the phenomenon.
A final component of the model is the search strategy. This defines how many data points
are used to generate a value for an unsampled location. Their spatial configuration
(location with respect to one another and to the unsampled location) can also be defined.
Both factors affect the interpolated value and its associated uncertainty. For many
methods, a search ellipse is defined, along with the number of sectors the ellipse is split
into and how many points are taken from each sector to make a prediction (see
Search_neighborhood).
Once the model has been completely defined, it can be used in conjunction with the
dataset to generate interpolated values for all unsampled locations within an area of
interest. The output is usually a map showing values of the variable being modeled. The
effect of outliers can be investigated at this stage, as they will probably change the model's
parameter values and thus the interpolated map. Depending on the interpolation method,
the same model can also be used to generate measures of uncertainty for the interpolated
values. Not all models have this capability, so it is important to define at the start if
measures of uncertainty are needed. This determines which of the models are suitable
(see Classification trees).
As with all modeling endeavors, the model's output should be checked, that is, make sure
that the interpolated values and associated measures of uncertainty are reasonable and
match your expectations.
Once the model has been satisfactorily built, adjusted, and its output checked, the results
can be used in risk analyses and decision making.
The Extensión ArcGIS Geostatistical Analyst provides the capability for surface modeling
using deterministic and geostatistical methods. The tools it provides are fully integrated
with the GIS modeling environments and allow GIS professionals to generate interpolation
models and assess their quality before using them in any further analysis. Surfaces (model
output) can subsequently be used in models (both in the ModelBuilder and Python
environments), visualized, and analyzed using other ArcGIS extensions, such as
Extensión ArcGIS Spatial Analyst and Extensión 3D Analyst de ArcGIS.
The tools provided in the Extensión ArcGIS Geostatistical Analyst are grouped into three
categories:
The Geostatistical Wizard (also accessed through the toolbar) leads analysts
through the process of creating and evaluating an interpolation model.
A set of geoprocessing tools that are specially designed to work with the outputs of
the models and extend the capabilities of the Geostatistical Wizard.
The following topics provide information on how to access and configure the Extensión
ArcGIS Geostatistical Analyst and a quick tour of its main components and features:
How to access and configure the extension: Enabling the Extensión ArcGIS
Geostatistical Analyst and Adding the Geostatistical Analyst toolbar to ArcMap
A quick tour of the main components and features: A quick tour of Geostatistical
Analyst
The following terms and concepts arise repeatedly in geostatistics and within
Geostatistical Analyst.
Term Description
Cross validation A technique used to assess how accurate an interpolation model is. In
Geostatistical Analyst, cross validation leaves one point out and uses the rest
to predict a value at that location. The point is then added back into the
dataset, and a different one is removed. This is done for all samples in the
dataset and provides pairs of predicted and known values that can be
compared to assess the model's performance. Results are usually
summarized as Mean and Root Mean Squared errors.
Geostatistical In Geostatistical Analyst, geostatistical methods are those that are based on
methods statistical models that include autocorrelation (the statistical relationships
among the measured points). These techniques have the capability of
producing prediction surfaces, and also some measure of the uncertainty
(error) associated with these predictions.
Interpolation A process that uses measured values taken at known sample locations to
predict (estimate) values for unsampled locations. Geostatistical Analyst
offers several interpolation methods, which differ based on their underlying
assumptions, data requirements, and capabilities to generate different types
of output (for example, predicted values as well as the errors [uncertainties]
associated with them).
Search Most of the interpolation methods use a local subset of the data to make
neighborhood predictions. Imagine a moving window—only data within the window is used
to make a prediction at the center of the window. This is done because there
is redundant information in samples that are far away from the location where
we need to make a prediction and to speed up the computing time required
to generate predicted value for the entire study area. The choice of
neighborhood (number of nearby samples and their spatial configuration
within the window) will affect the prediction surface, and should be chosen
with care.
Transformation A data transformation is done when a function (log, Box-Cox, arcsin, Normal
score) is applied to the data to change the shape of its distribution and/or
stabilize the variance (reduce the relationship between the mean and
variance, for example, that data variability increases as the mean value
increases).
Validation Similar to cross validation, but instead of using the same dataset to build and
evaluate the model, two datasets are used—one to build the model and the
other as an independent test of performance. If only one dataset is available,
the Subset Features tool can be used to randomly split it into training and test
subsets.
Before using the interpolation techniques, you should explore your data using the
exploratory spatial data analysis tools. These tools allow you to gain insights into your data
and to select the most appropriate method and parameters for the interpolation model. For
example, when using ordinary kriging to produce a quantile map, you should examine the
distribution of the input data because this particular method assumes that the data is
normally distributed. If your data is not normally distributed, you should include a data
transformation as part of the interpolation model. A second example is that you might
detect a spatial trend in your data using the ESDA tools and want to include a step to
model it independently as part of the prediction process.
The ESDA tools are accessed through the Geostatistical Analyst toolbar (shown below)
and are composed of the following:
The following graphic illustrates the ESDA tools used for analyzing one dataset at a time:
The following graphic depicts the two tools that are designed to examine relationships
between two datasets:
Geostatistical Wizard
The Geostatistical Wizard is a dynamic set of pages that is designed to guide you through
the process of constructing and evaluating the performance of an interpolation model.
Choices made on one page determine which options will be available on the following
pages and how you interact with the data to develop a suitable model. The wizard guides
you from the point when you choose an interpolation method all the way to viewing
summary measures of the model's expected performance. A simple version of this
workflow (for inverse distance weighted interpolation) is represented graphically below:
During construction of an interpolation model, the wizard allows changes in parameter
values, suggests or provides optimized parameter values, and allows you to move forward
or backward in the process to assess the cross-validation results to see whether the
current model is satisfactory or some of the parameter values should be modified. This
flexibility, in addition to dynamic data and surface previews, makes the wizard a powerful
environment in which to build interpolation models.
Deterministic methods
Deterministic techniques have parameters that control either (1) the extent of similarity (for
example, inverse distance weighted) of the values or (2) the degree of smoothing (for
example, radial basis functions) in the surface. These techniques are not based on a
random spatial process model, and there is no explicit measurement or modeling of spatial
autocorrelation in the data. Deterministic methods include the following:
Geostatistical methods
Geostatistical techniques assume that at least some of the spatial variation observed in
natural phenomena can be modeled by random processes with spatial autocorrelation and
require that the spatial autocorrelation be explicitly modeled. Geostatistical techniques can
be used to describe and model spatial patterns (variography), predict values at
unmeasured locations (kriging), and assess the uncertainty associated with a predicted
value at the unmeasured locations (kriging).
The Geostatistical Wizard offers several types of kriging, which are suitable for different
types of data and have different underlying assumptions:
Ordinary
Simple
Universal
Indicator
Probability
Disjunctive
Areal interpolation
Empirical Bayesian
The Geostatistical Analyst toolbox includes tools for analyzing data, producing a variety of
output surfaces, examining and transforming geostatistical layers to other formats,
performing geostatistical simulation and sensitivity analysis, and aiding in designing
sampling networks. The tools have been grouped into five toolsets:
While cross-validation is provided for all methods available in the Geostatistical Wizard
and can also be run for any geostatistical layer using the Cross Validation geoprocessing
tool, a more rigorous way to assess the quality of an output surface is to compare
predicted values with measurements that were not used to construct the interpolation
model. As it is not always possible to go back to the study area to collect an independent
validation dataset, one solution is to divide the original dataset into two parts. One part can
be used to construct the model and produce a surface. The other part can be used to
compare and validate the output surface. The Subset Features tool enables you to split a
dataset into training and test datasets. The Subset Features tool is a geoprocessing tool
(housed in the Geostatistical Analyst toolbox shown in the section above). For
convenience, this tool is also available from the Geostatistical Analyst toolbar, as shown in
the following figure:
For further information on this tool and how to use it, see How Subset Features works in
Geostatistical Analyst and Using validation to assess models.
Geostatistical Analyst includes many tools for analyzing data and producing a variety of
output surfaces. While the reasons for your investigations might vary, you're encouraged
to adopt the approach described in The geostatistical workflow when analyzing and
mapping spatial processes:
Both the Geostatistical Wizard and Geostatistical Analyst toolbox offer many interpolation
methods. You should always have a clear understanding of the objectives of your study
and how the predicted values (and other associated information) will help you make more
informed decisions when choosing a method. To provide some guidance, see
Classification trees for a set of classification trees of the diverse methods.
(Data shown in the figures was provided courtesy of the Alaska Fisheries Science Center.)
It is important to remember, however, that these methods are a means that allows you to
construct models of reality (that is, of the phenomenon you are interested in). It is up to
you, the practitioner, to build models that suit your specific needs and provide the
information necessary to make informed and defensible decisions. A big part of building a
good model is your understanding of the phenomenon, how the sample data was obtained
and what it represents, and what you expect the model to provide. General steps in the
process of building a model are described in The geostatistical workflow.
Many interpolation methods exist. Some are quite flexible and can accommodate different
aspects of the sample data. Others are more restrictive and require that the data meet
specific conditions. Kriging methods, for example, are quite flexible, but within the kriging
family there are varying degrees of conditions that must be met for the output to be valid.
Geostatistical Analyst offers the following interpolation methods:
Global polynomial
Local polynomial
Diffusion kernel
Kernel smoothing
Ordinary kriging
Simple kriging
Universal kriging
Indicator kriging
Probability kriging
Disjunctive kriging
Areal interpolation
Empirical Bayesian kriging
Each of these methods has its own set of parameters, allowing it to be customized for a
particular dataset and requirements on the output that it generates. To provide some
guidance in selecting which to use, the methods have been classified according to several
different criteria, as shown in Classification trees of the interpolation methods offered in
Geostatistical Analyst. After you clearly define the goal of developing an interpolation
model and fully examine the sample data, these classification trees may be able to guide
you to an appropriate method.
One of the most important decisions you will have to make is to define what your
objective(s) is in developing an interpolation model. In other words, what information do
you need the model to provide so that you can make a decision? For example, in the
public health arena, interpolation models are used to predict levels of contaminants that
can be statistically associated with disease rates. Based on that information, further
sampling studies can be designed, public health policies can be developed, and so on.
Geostatistical Analyst offers many different interpolation methods. Each has unique
qualities and provides different information (in some cases, methods provide similar
information; in other cases, the information may be quite different). The following diagrams
show these methods classified according to different criteria. Choose a criterion that is
important for your particular situation and a branch in the corresponding tree that
represents the option that you are interested in. This will lead you to one or more
interpolation methods that may be appropriate for your situation. Most likely, you will have
several important criteria to meet and will use several of the classification trees. Compare
the interpolation methods suggested by each tree branch you follow and pick a few
methods to contrast before deciding on a final model.
The first tree suggests methods based on their ability to generate predictions or
predictions and associated errors.
Some methods require a model of spatial autocorrelation to generate predicted values, but
others do not. Modeling spatial autocorrelation requires defining extra parameter values
and interactively fitting a model to the data.
Different methods generate different types of output, which is why you must decide what
type of information you need to generate prior to building the interpolation model.
Interpolation methods vary in their levels of complexity, which can be measured by the
number of assumptions that must be met for the model to be valid.
Some interpolators are exact (at each input data location, the surface will have exactly the
same value as the input data value), while others are not. Exact replication of the input
data may be important in some situations.
Some methods produce surfaces that are smoother than others. Radial basis functions are
smooth by construction, for example. The use of a smooth search neighborhood will
produce smoother surfaces than a standard search neighborhood.
For some decisions, it is important to consider not only the predicted value at a location
but also the uncertainty (variability) associated with that prediction. Some methods provide
measures of uncertainty, while others do not.
Finally, processing speed may be a factor in your analysis. In general, most of the
interpolation methods are relatively fast, except when barriers are used to control the
interpolation process.
The classification trees use the following abbreviations for the interpolation methods:
As mentioned in The geostatistical workflow, there are many stages involved in creating a
surface. The first is to fully explore the data and identify important features that will be
incorporated in the model. These features must be identified at the beginning of the
process because a number of choices have to be made and parameter values have to be
specified in each stage of building the model. Note that, in the Geostatistical Wizard, the
choices you make determine the options that are available in the following steps of the
process, so it is important to identify the main features of the model before starting to build
it. While the Geostatistical Wizard provides reliable default values (some of which are
calculated specifically for your data), it cannot interpret the context of your study or the
objectives you have in creating the model. It is critical that you create and refine the model
based on additional insights gained from prior knowledge of the phenomenon and data
exploration in order to generate a more accurate surface.
The following topics provide more detail regarding data exploration and information on how
to use the findings when building an interpolation model:
Map the data—Covers the first step in data exploration: mapping the data using a
classification scheme that shows the important features.
Looking for global and local outliers—Presents techniques for identifying global and
local outliers using the Histogram, Semivariogram/Covariance Cloud, and Voronoi
Map tools.
Trend analysis—Examines how to identify global trends in the data using the Trend
Plot tool.
Examining local variation—Indicates how to use the Voronoi Map tool to show
whether the local mean and local standard deviation are relatively constant over
the study area (a visualization of stationarity). The tool also provides other local
factors (including clustering) that can be useful in identifying outliers.
The first step for any analysis is to map and examine the data. This provides you with a
first look at the spatial components of the dataset and may give indications of outliers and
erroneous data values, global trends, and the dominant directions of spatial
autocorrelation, among other factors, all of which are important in the development of an
interpolation model that accurately reflects the phenomenon you are interested in.
ArcGIS offers many ways to visualize data: ArcMap provides access to many classification
schemes and color ramps, which can be used to highlight different aspects of the data,
whereas ArcScene allows the data to be rendered in 3D space, which is useful when
looking for local outliers and global trends. While there is no one correct way to display the
data, the following figures illustrate different renderings of the same data that allow
different aspects of interest to be seen. For more detailed information on classification
schemes available in ArcGIS, see Classifying numerical fields for graduated symbols.
The initial view of the data provided by ArcMap uses the same symbol for all the sample
points. This view provides information on the spatial extent of the samples, coverage of the
study area (if a boundary is available), and indicates whether there were areas that were
more heavily or intensely sampled than others (called preferential sampling). In some
interpolation models (specifically simple kriging models built as a basis for geostatistical
simulation and disjunctive kriging models), it is important to use a declustering technique
(see Implementing declustering to adjust for preferential sampling) to obtain a dataset that
is representative of the phenomenon and is not affected by oversampling in high- or low-
valued regions of the study area.
A second step in mapping the data is to use a classification scheme and color ramp that
show data values and their spatial relationship. By default, ArcMap will apply a natural
breaks classification to the data. This is shown in the following figure, which uses five
classes and a color scheme with blue for cold water temperatures and red for warmer
water temperatures.
Natural breaks looks for statistically large differences between adjacent pairs of data (the
data is sorted by value, not by location). In this case, warmer temperatures occur on the
westernmost samples, while those in the center of the study area are colder. Samples
closest to mainland Alaska show warmer temperatures. The map also shows that
temperatures are fairly constant along lines going from the northwest to the southeast.
These two findings can be interpreted as a trough of colder water in the center of the
sampled area, which runs from the northwest toward the southeast. This is a global trend
in the data and can be modeled as a second order polynomial using global polynomial
interpolation or local polynomial interpolation or as a trend in kriging.
Other methods that can be used to classify the data are equal interval (which uses classes
of equal width) and quantile (which breaks the data into classes that all have the same
number of data values). Both of these classifications are shown below and essentially
show the same spatial features as the natural breaks classification for this dataset.
A different view of the data is provided by a classification based on the statistical
distribution of the data values. This rendering can be helpful in identifying outliers and
erroneous data. The following figure uses the standard deviation classification and a color
ramp that shows positive deviations from the mean in red and negative deviations from the
mean in blue.
This classification refines the preliminary assessment: positive deviations from the mean
occur in the westernmost samples, while in the center of the sampled area, there is a zone
of colder temperatures (negative deviations from the mean) running from the northwest to
the southeast. Samples closest to the Alaskan mainland do not deviate much from the
mean (shown in yellow). The standard deviation classification can be adjusted manually to
represent a more common approach to finding outlying values: the class breaks are
adjusted to show values that deviate more than one standard deviation from the mean.
The central portion of the data (that is, values that fall between the mean minus one
standard deviation and the mean plus one standard deviation) will contain 64 percent of
the data values if the data is normally (Gaussian) distributed. This adjusted classification is
shown below and shows more clearly those values that deviate significantly from the
mean. In this case, the standard deviation classification confirms what was observed in
using the natural breaks, equal interval, and quantile classifications.
In visually exploring the data, it may be worthwhile to investigate how the number of
classes affects the rendering of the data. The number of classes should be sufficient to
show local detail in the data values but not so many that general features would be hidden.
For the data used in these examples, five classes were adequate. Nine classes did not
add much to the maps and made interpreting the main spatial features less
straightforward.
After mapping the data, a second stage of data exploration should be performed using the
Exploratory Spatial Data Analysis (ESDA) tools. These tools allow you to examine the data
in more quantitative ways than mapping it and let you gain a deeper understanding of the
phenomena you are investigating so that you can make more informed decisions on how
the interpolation model should be constructed. The most common tasks you should
perform to explore your data are the following:
Not all these steps are necessary in all cases. For example, if you decide to use an
interpolation method that does not require a measure of spatial autocorrelation (GPI, LPI,
or RBF), then it is not necessary to explore spatial autocorrelation in the data. It may,
however, be a good idea to explore it anyway, as a significant amount of spatial
autocorrelation can lead to using a different interpolation method (kriging, for example)
than the one you had originally intended to use.
To help you accomplish these tasks, the ESDA tools allow different views into the data.
These views can be manipulated and explored, and all are interconnected among
themselves and with the data displayed in ArcMap through brushing and linking.
Histogram
Trend Analysis
Voronoi Map
Semivariogram/Covariance Cloud
Crosscovariance Cloud
The views in ESDA are interconnected by selecting (brushing) and highlighting the
selected points on all maps and graphs (linking). Brushing is a graphic way to perform a
selection in either the ArcMap data view or in an ESDA tool. Any selection that occurs in
an ESDA view or in the ArcMap data view is selected in all the ESDA windows as well as
in ArcMap, which is linking.
For the Histogram, Voronoi Map, QQ Plot, and Trend Analysis tools, the graph bars,
points, or polygons that are selected in the tool view are linked to points in the ArcMap
data view, which are also highlighted. For the Semivariogram/Covariance tools, points in
the plots represent pairs of locations, and when some points are selected in the tool, the
corresponding pairs of points are highlighted in the ArcMap data view, with a line
connecting each pair. When pairs of points in the ArcMap data view are selected, the
corresponding points are highlighted in the Semivariogram/Covariance plot.
Most of the interpolation methods provided by ArcGIS Geostatistical Analyst do not require
the data to be normally distributed, although in this case the prediction map may not be
optimal. That is, data transformations that change the shape (distribution) of the data are
not required as part of the interpolation model. However, certain kriging methods require
the data to be approximately normally distributed (close to a bell-shaped curve). In
particular, quantile and probability maps created using ordinary, simple, or universal
kriging assume that the data comes from a multivariate normal distribution. In addition,
simple kriging models, which are used as a basis for geostatistical simulation (that is,
models used as input to the Gaussian Geostatistical Simulation tool—refer to
Geostatistical simulation concepts and How Gaussian geostatistical simulations work),
should use data that is normally distributed or include a normal score transformation as
part of the model to ensure this.
Normally distributed data has a probability density function that looks like the one shown in
the following diagram:
The Histogram and Normal QQ plot are designed to help you explore the distribution of
your data, and they include different data transformations (Box–Cox, logarithmic, and
arcsine) so that you can assess the effects they have on the data. To learn more about the
transformations that are available in these tools, see Box-Cox, Arcsine, and Log
transformations.
All kriging methods rely on the assumption of stationarity. This assumption requires, in
part, that all data values come from distributions that have the same variability. Data
transformations can also be used to satisfy this assumption of equal variability. For more
information on stationarity, see Random processes with dependence.
2.6.2. Examining the distribution of your data using histograms and normal
QQ plots
The ESDA tools (refer to Exploratory Spatial Data Analysis) help you examine the
distribution of your data.
When checking whether your data is normally distributed (close to a bell-shaped curve),
the Histogram and Normal QQ Plots will help you. In the summary statistics provided by
the Histogram, the mean and median will be similar, the skewness should be near zero,
and the kurtosis should be near 3 if the data is normally distributed. If the data is highly
skewed, you may choose to transform it to see if you can make it more normally
distributed. Note that back transformation process generates approximately unbiased
predictions with approximate kriging standard errors when you use Universal or Ordinary
Kriging.
The Normal QQ plot provides a visual comparison of your dataset to a standard normal
distribution, and you can investigate points that cause departures from a normal
distribution by selecting them in the plot and examining their locations on a map. For an
example, refer to Normal QQ and general QQ plots. Data transformations can also be
used in the Normal QQ Plot.
Pasos:
Click the point feature layer in the ArcMap table of contents that you want to
examine.
Click the Geostatistical Analyst toolbar, point to Explore Data, then click either
Histogram or Normal QQ Plot.
Sugerencia: In the Histogram, make sure that the Statistics box is checked to see
summary statistics for the data.
Sugerencia: In the Normal QQ Plot, the points will fall close to the 45 degree reference line
if the data is normally distributed.
2.6.3. Histograms
The Histogram tool provides a univariate (one-variable) description of your data. The tool
dialog box displays the frequency distribution for the dataset of interest and calculates
summary statistics.
Frequency distribution
The frequency distribution is a bar graph that displays how often observed values fall
within certain intervals or classes. You can specify the number of classes of equal width
that are used in the histogram. The relative proportion of data that falls in each class is
represented by the height of each bar. For example, the histogram below shows the
frequency distribution (10 classes) for a dataset.
Summary statistics
The important features of a distribution can be summarized by statistics that describe its
location, spread, and shape.
Measures of location
Measures of location provide you with an idea of where the center and other parts of the
distribution lie.
The mean is the arithmetic average of the data. The mean provides a measure of
the center of the distribution.
The median value corresponds to a cumulative proportion of 0.5. If the data was
arranged in increasing order, 50 percent of the values would lie below the median,
and 50 percent of the values would lie above the median. The median provides
another measure of the center of the distribution.
The first and third quartiles correspond to the cumulative proportion of 0.25 and
0.75, respectively. If the data was arranged in increasing order, 25 percent of the
values would lie below the first quartile, and 25 percent of the values would lie
above the third quartile. The first and third quartiles are special cases of quantiles.
The quantiles are calculated as follows:
quantile = (i - 0.5) / N
Measures of spread
The spread of points around the mean value is another characteristic of the displayed
frequency distribution.
The variance of the data is the average squared deviation of all values from the
mean. Because it involves squared differences, the calculated variance is sensitive
to unusually high or low values. The variance is estimated by summing the squared
deviations from the mean and dividing the sum by (N-1).
The standard deviation is the square root of the variance, and it describes the
spread of the data about the mean. The smaller the variance and standard
deviation, the tighter the cluster of measurements about the mean value.
The diagram below shows two distributions with different standard deviations. The
frequency distribution represented by the black line is more variable (wider spread) than
the frequency distribution represented by the red line. The variance and standard deviation
for the black frequency distribution are greater than those for the red frequency
distribution.
Measures of shape
Examples
With the Histogram tool, you can examine the shape of the distribution by direct
observation. By reviewing the mean and median statistics, you can determine the center
location of the distribution. Notice that in the figure below the distribution is bell-shaped,
and since the mean and median values are very close, this distribution is close to normal.
You can also highlight the extreme values in the tail of the histogram and see how they are
spatially located in the displayed map.
If your data is highly skewed, you can test the effects of a transformation on your data.
This figure shows a skewed distribution before a transformation is applied.
A log transformation is applied to the skewed data, and in this case, the transformation
makes the distribution close to normal.
2.6.4. Normal QQ plot and general QQ plot
Quantile-quantile (QQ) plots are graphs on which quantiles from two distributions are
plotted relative to each other.
First, the data values are ordered and cumulative distribution values are calculated as (i–
0.5)/n for the ith ordered value out of n total values (this gives the proportion of the data
that falls below a certain value). A cumulative distribution graph is produced by plotting the
ordered data versus the cumulative distribution values (graph on the top left in the figure
below). The same process is done for a standard normal distribution (a Gaussian
distribution with a mean of 0 and a standard deviation of 1, shown in the graph on the top
right of the figure below). Once these two cumulative distribution graphs have been
generated, data values corresponding to specific quantiles are paired and plotted in a QQ
plot (bottom graph in the figure below).
General QQ plots are used to assess the similarity of the distributions of two datasets.
These plots are created following a similar procedure as described for the Normal QQ plot,
but instead of using a standard normal distribution as the second dataset, any dataset can
be used. If the two datasets have identical distributions, points in the general QQ plot will
fall on a straight (45-degree) line.
Examining data distributions using QQ plots
Points on the Normal QQ plot provide an indication of univariate normality of the dataset. If
the data is normally distributed, the points will fall on the 45-degree reference line. If the
data is not normally distributed, the points will deviate from the reference line.
In the diagram below, the quantile values of the standard normal distribution are plotted on
the x-axis in the Normal QQ plot, and the corresponding quantile values of the dataset are
plotted on the y-axis. You can see that the points fall close to the 45-degree reference line.
The main departure from this line occurs at high values of ozone concentration.
The Normal QQ Plot tool allows you to select the points that do not fall close to the
reference line. The location of the selected points are then highlighted in the ArcMap data
view. As seen below, they are concentrated around the San Francisco Bay area (points
shaded in pink on the map below).
An example of using data transformations
However, as can be seen in the figure below, when a log transformation is applied to the
dataset, the points lie closer to the 45-degree reference line.
Box-Cox and arcsine transformations can also be applied to the data within the Normal
QQ Plot tool to assess their effect on the normality of the distribution.
Some methods in Geostatistical Analyst require that the data be normally distributed.
When the data is skewed (the distribution is lopsided), you might want to transform the
data to make it normal. The Histogram and Normal QQ Plot allow you to explore the
effects of different transformations on the distribution of the dataset. If the interpolation
model you build uses one of the kriging methods, and you choose to transform the data as
one of the steps, the predictions will be transformed back to the original scale in the
interpolated surface.
Geostatistical Analyst allows the use of several transformations including Box-Cox (also
known as power transformations), arcsine, and logarithmic. Suppose you observe data
Z(s), and apply some transformation Y(s) = t(Z(s)). Usually, you want to find the
transformation so that Y(s) is normally distributed. What often happens is that the
transformation also yields data that has constant variance through the study area.
Box-Cox transformation
for λ≠ 0.
For example, suppose that your data is composed of counts of some phenomenon. For
these types of data, the variance is often related to the mean. That is, if you have small
counts in part of your study area, the variability in that local region will be smaller than the
variability in another region where the counts are larger. In this case, the square-root
transformation may help to make the variances more constant throughout the study area
and often makes the data appear normally distributed as well. The square-root
transformation is a special case of the Box-Cox transformation when λ = ½.
Log transformation
The log transformation is actually a special case of the Box-Cox transformation when λ =
0; the transformation is as follows:
Y(s) = ln(Z(s)),
The log transformation is often used where the data has a positively skewed distribution
(shown below) and there are a few very large values. If these large values are located in
your study area, the log transformation will help make the variances more constant and
normalize your data. Concerning terminology, when a log transformation is implemented
with kriging, the prediction method is known as lognormal kriging, whereas for all other
values of λ, the associated kriging method is known as trans-Gaussian kriging.
Arcsine transformation
Y(s) = sin-1(Z(s)),
The arcsine transformation can be used for data that represents proportions or
percentages. Often, when the data is proportions, the variance is smallest near 0 and 1
and largest near 0.5. The arcsine transformation will help make the variances more
constant throughout your study area and often makes the data appear normally distributed
as well.
Data transformations can be used to make the variances constant throughout your study
area and make the data distribution closer to normal. Understanding transformations and
trends provides a more detailed discussion of transformations in Geostatistical Analyst.
Use the Histogram Normal QQ plot and Voronoi Map in Exploratory Spatial Data Analysis
to try different transformations assess their effects on the distribution of the data. Box-Cox,
arcsine, and log transformations discusses each of these transformations in more detail.
Keep in mind that some geostatistical methods assume and require data with the normal
distribution of values. (Quantile and probability maps generated using Ordinary, Simple,
and Universal kriging, as well as any map generated using Disjunctive kriging and the
Gaussian Geostatistical Simulations geoprocessing tool).
Pasos:
Choose the desired transformation from the Transformation type drop-down menu
for each dataset to which you would like to apply a transformation.
Click Next.
To decide which type of transformation might be the most appropriate for your data, go to
Explore Data of the Geostatistical Toolbar and choose the Histogram option. Use the
Transformation section of the interface to display the histogram after different
transformations have been applied to the data. Choose the transformation that makes your
data have a distribution that is closest to a normal distribution.
Some interpolation and simulation methods require the input data to be normally
distributed (refer to Examine the distribution of your data for a list of these methods). The
normal score transformation (NST) is designed to transform your dataset so that it closely
resembles a standard normal distribution. It does this by ranking the values in your dataset
from lowest to highest and matching these ranks to equivalent ranks generated from a
normal distribution. Steps in the transformation are as follows: your dataset is sorted and
ranked, an equivalent rank from a standard normal distribution is found for each rank from
your dataset, and the normal distribution values associated with those ranks make up the
transformed dataset. The ranking process can be done using the frequency distribution or
the cumulative distribution of the datasets.
Examples showing histograms and cumulative distributions before and after a normal
score transformation was applied are shown below:
Histograms before and after a normal score transformation
Approximation methods
In Geostatistical Analyst, there are four approximation methods: direct, linear, Gaussian
kernels, and multiplicative skewing. The direct method uses the observed cumulative
distribution, the linear method fits lines between each step of the cumulative distribution,
and the Gaussian kernels method approximates the cumulative distribution by fitting a
linear combination of component cumulative normal distributions. Multiplicative skewing
approximates the cumulative distribution by fitting a base distribution (Student's t,
lognormal, gamma, empirical, and log empirical) that is then skewed by a fitted linear
combination of beta distributions (the skewing is done with the inverse probability integral
transformation). Lognormal, gamma, and log empirical base distributions can only be used
for positive data, and the predictions are guaranteed to be positive. Akaike's Information
Criterion (AIC) is provided to judge the quality of the fitted model.
Normal score transformation (NST) changes your data so that it follows a univariate
standard normal distribution. This is an important step when you create quantile or
probability maps using simple, probability, or disjunctive kriging. Also, simple kriging
models that will be used as input to the Gaussian Geostatistical Simulations tool should be
based on normally distributed data. The steps below describe how to apply a normal score
transformation to the data when using simple kriging.
Pasos:
Click Next.
Click Next.
Choose the number of bars you want to display in the Density chart by setting the
slider in the upper right part of the interface. Click Cumulative to switch the display
to a cumulative distribution of the data, or click Normal QQPlot to display the
normal quantile-quantile plot of the data after the transformation.
If you are using cokriging and need to transform the other dataset(s), switch
datasets by clicking the Dataset Selection arrow. Define a normal score
transformation for the other datasets following steps 5 and 6.
Click Next and continue with the other steps in Geostatistical Wizard to create a
surface using a normal score transformation on your data.
The Normal score transformation (NST) is different from the Box-Cox, arcsine, and log
transformations (BAL) in several ways:
The NST function adapts to each particular dataset, whereas BAL transformations
do not (for example, the log transformation function always takes the natural
logarithm of the data).
The goal of the NST is to make the random errors of the whole population (not only
the sample) normally distributed. Due to this, it is important that the cumulative
distribution of the sample accurately reflects the true cumulative distribution of the
whole population (this requires correct sampling of the population and possibly
declustering to account for preferential sampling in some locations of the study
area). BAL, on the other hand, affects the sample data and can have goals of
stabilizing the variance, correcting skewness, or making the distribution closer to
normally distributed.
The NST must occur after detrending the data so that covariance and
semivariograms are calculated on residuals after trend correction. In contrast, BAL
transformations are used to attempt to remove any relationship between the
variance and the trend. Because of this, after the BAL transformation has been
applied to the data, you can optionally remove the trend and model spatial
autocorrelation. A consequence of this process is that you often get residuals that
are approximately normally distributed, but this is not a specific goal of BAL
transformations like it is for the NST transformation.
A global outlier is a measured sample point that has a very high or a very low value
relative to all the values in a dataset. For example, if 99 out of 100 points have values
between 300 and 400, but the 100th point has a value of 750, the 100th point may be a
global outlier.
A local outlier is a measured sample point that has a value within the normal range for the
entire dataset, but if you look at the surrounding points, it is unusually high or low. For
example, the diagram below is a cross section of a valley in a landscape. However, there
is one point in the center of the valley that has an unusually high value relative to its
surroundings, but it is not unusual compared to the entire dataset.
Local outliers
It is important to identify outliers for two reasons: they may be real abnormalities in the
phenomenon, or the value might have been measured or recorded incorrectly.
If an outlier is an actual abnormality in the phenomenon, this may be the most significant
point of the study and for understanding the phenomenon. For instance, a sample on the
vein of a mineral ore might be an outlier and the location that is most important to a mining
company.
If outliers are caused by errors during data entry that are clearly incorrect, they should
either be corrected or removed before creating a surface. Outliers can have several
detrimental effects on your prediction surface because of effects on semivariogram
modeling and the influence of neighboring values.
The Histogram tool enables you to select points on the tail of the distribution. The selected
points are displayed in the ArcMap data view. If the extreme values are isolated locations
(for instance, surrounded by very different values), they may require further investigation
and be removed if necessary.
Histogram and QQ Plot Map
In the example above, the high ozone values are not outliers and should not be removed
from the dataset.
If you have a global outlier with an unusually high value in your dataset, all pairings of
points with that outlier will have high values in the Semivariogram cloud, no matter what
the distance is. This can be seen in the semivariogram cloud and in the histogram shown
below. Notice that there are two main strata of points in the semivariogram. If you brush
points in the upper strata, as demonstrated in the image, you can see in the ArcMap view
that all these high values come from pairings with a single location— a global outlier. Thus,
the upper stratum of points has been created by all the locations pairing with the single
outlier, and the lower stratum is composed of the pairings among the rest of the locations.
When you look at the histogram, you can see one high value on the right tail of the
histogram, again identifying the global outlier. This value was probably entered incorrectly
and should be removed or corrected.
Global outlier
When there is a local outlier, the value will not be out of the range of the entire distribution
but will be unusual relative to the surrounding values. In the local outlier histogram shown
below, you can see that pairs of locations that are close together have high semivariogram
values (these points are on the far left on the x-axis, indicating that they are close together,
and have high values on the y-axis, indicating that the semivariogram values are high).
When these points are brushed, you can see that all these points are pairings to a single
location. When you look at the histogram, you can see that there is no single value that is
unusual. The location in question is highlighted in the lower tail of the histogram and is
paired with higher surrounding values (see the highlighted points in the histogram). This
location may be a local outlier. Further investigation should be made before deciding if the
value at that point is erroneous or in fact reflects a true characteristic of the phenomenon
and should be included as part of the model.
Local outlier
The cluster method identifies those cells that are dissimilar to their surrounding neighbors.
You would expect the value recorded in a particular cell to be similar to at least one of its
neighbors. Therefore, this tool may be used to identify possible outliers.
Pasos:
Click the point or polygon feature layer in the ArcMap table of contents that you
want to explore.
Click the Geostatistical Analyst arrow on the Geostatistical Analyst toolbar, click
Explore Data, then click Histogram.
If on the very left (the extreme minimum) or on the very right-hand side (the extreme
maximum) of the histogram you see an isolated bar, it may be indicating that the
point that the bar represents is an outlier. The more isolated from the main group of
bars of the histogram such a bar is, the larger the likelihood that the point is indeed
an outlier.
The Semivariogram/Covariance Cloud tool is useful for detecting local outliers. They
appear as points that are close together (low values on the x-axis) but are high on the y-
axis, indicating that the two points making up that pair have very different values. This is
contrary to what you would expect—namely, that points that are close together have
similar values.
Pasos:
Click the point or polygon feature layer in the ArcMap table of contents that you
want to explore.
Pasos:
Click the point or polygon feature layer in the ArcMap table of contents that you
want to explore.
When viewing the Voronoi Map, check whether at any vicinity there are polygons with the
colors symbolizing very different categories of values. It is data dependable, but in general,
a polygon surrounded by others with the values of classes separated by two other classes
should catch your attention as indicating a potential outlier.
Since all but the Simple type of the Voronoi Map are based on the neighborhood
calculation, the Simple Voronoi map should be examined in a search of outliers.
You may be interested in mapping a trend, or you might want to remove a trend from the
dataset when using kriging. The Trend Analysis tool can help identify trends in the input
dataset.
The Trend Analysis tool provides a three-dimensional perspective of the data. The
locations of sample points are plotted on the x,y plane. Above each sample point, the
value is given by the height of a stick in the z-dimension. A unique feature of the Trend
Analysis tool is that the values are then projected onto the x,z plane and the y,z plane as
scatterplots. This can be thought of as sideways views through the three-dimensional data.
Polynomials are then fit through the scatterplots on the projected planes. An additional
feature is that you can rotate the data to isolate directional trends. The tool also includes
other features that allow you to rotate and vary the perspective of the whole image,
change size and color of points and lines, remove planes and points, and select the order
of the polynomial that is to fit the scatterplots. By default, the tool will select second-order
polynomials to show trends in the data, but you may want to investigate polynomials of
order one and three to assess how well they fit the data.
Trend Analysis tool
The Trend Analysis tool raises the points above a plot of the study site to the height of the
values of the attribute of interest in a three-dimensional plot of the study area. The points
are then projected in two directions (by default, north and west) onto planes that are
perpendicular to the map plane. A polynomial curve is fit to each projection. The entire
map surface can be rotated in any direction, which also changes the direction represented
by the projected planes. If the curve through the projected points is flat, no trend exists, as
shown by the blue line in the illustration below.
Trend analysis flat
If there is a definite pattern to the polynomial, such as an upward curve as shown by the
green line in the illustration above, this suggests that there is a trend in the data.
In the example below, the trend is accentuated, and it demonstrates a strong upside-down
U shape. This suggests that a second-order polynomial can be fit to the data. Through the
refinement allowed in the Trend Analysis tool, the true direction of the trend can be
identified. In this case, its strongest influence is from the center of the region toward all the
borders (that is, the highest values occur in the center of the region, and lower values
occur near the edges).
Trend analysis upside-down U shape
To identify a global trend in your data, look for a curve that is not flat on the projected
plane.
If you have a global trend in your data, you may want to create a surface using one of the
deterministic interpolation methods (for example, global or local polynomial), or you may
wish to remove the trend when using kriging.
Pasos:
Click the point or polygon feature layer in the ArcMap table of contents that you
want to explore.
Click the Geostatistical Analyst drop-down menu on the Geostatistical Analyst
toolbar, click Explore Data, then click Trend Analysis.
On the Trend Analysis interface, click the Trend and Projections choice under the
Graph Options.
Explore the bold lines on the vertical walls of the graph. These lines are indicating
trends. One trend line goes along the x-axis (typically showing the longitudinal
trend), while the other one shows the trend along the y-axis (typically the latitudinal
trend). It is very useful to change the Order of Polynomial while examining the
trends.
Sugerencia: It can be very helpful to check for trends in directions that vary from the
standard N–S and E–W. To enable such a view, rotate the trend axes by scrolling the
upper wheel on the right-hand side of the tool, just under the main display window.
A surface may be made up of two main components: a fixed global trend and random
short-range variation. The global trend is sometimes referred to as the fixed mean
structure. Random short-range variation (sometimes referred to as random error) can be
modeled in two parts: spatial autocorrelation and the nugget effect.
If you decide a global trend exists in your data, you must decide how to model it. Whether
you use a deterministic method or a geostatistical method to create a surface usually
depends on your objective. If you want to model just the global trend and create a smooth
surface, you may use a global or local polynomial interpolation method to create a final
surface. However, you may want to incorporate the trend in a geostatistical method (for
instance, remove the trend and model the remaining component as random short-range
variation). The main reason to remove a trend in geostatistics is to satisfy stationarity
assumptions. Trends should only be removed if there is justification for doing so.
If you remove the global trend in a geostatistical method, you will be modeling the random
short-range variation in the residuals. However, before making an actual prediction, the
trend will be automatically added back so that you obtain reasonable results.
If you deconstruct your data into a global trend plus short-range variation, you are
assuming that the trend is fixed and the short-range variation is random. Here, random
does not mean unpredictable, but rather that it is governed by rules of probability that
include dependence on neighboring values called autocorrelation. The final surface is the
sum of the fixed and random surfaces. That is, think of adding two layers, one that never
changes and another that changes randomly. For example, suppose you are studying
biomass. If you were to go back in time 1,000 years and start over to the present day, the
global trend of the biomass surface would be unchanged. However, the short-range
variation of the biomass surface would change. The unchanging global trend could be due
to fixed effects such as topography. Short-range variation could be caused by less
permanent features that could not be observed through time, such as precipitation, so it is
assumed it is random and likely to be autocorrelated.
If you can identify and quantify the trend, you will gain a deeper understanding of your data
and make better decisions. If you remove the trend, you will be able to more accurately
model the random short-range variation, because the global trend will not be influencing
kriging assumption about data stationarity.
Voronoi maps are constructed from a series of polygons formed around the location of a
sample point.
Voronoi polygons are created so that every location within a polygon is closer to the
sample point in that polygon than any other sample point. After the polygons are created,
neighbors of a sample point are defined as any other sample point whose polygon shares
a border with the chosen sample point. For example, in the following figure, the bright
green sample point is enclosed by a polygon, which has been highlighted in red. Every
location within the red polygon is closer to the bright green sample point than to any other
sample point (given as small dark blue dots). The blue polygons all share a border with the
red polygon, so the sample points within the blue polygons are neighbors of the bright
green sample point.
Using this definition of neighbors, a variety of local statistics can be computed. For
example, a local mean is computed by taking the average of the sample points in the red
and blue polygons. This average is then assigned to the red polygon. This process is
repeated for all polygons and their neighbors, and the results are shown using a color
ramp to help visualize regions of high and low local values.
The Voronoi Map tool provides the following methods to assign or calculate values for the
polygons.
Simple: The value assigned to a polygon is the value recorded at the sample point
within that polygon.
Mean: The value assigned to a polygon is the mean value that is calculated from
the polygon and its neighbors.
Mode: All polygons are categorized using five class intervals. The value assigned
to a polygon is the mode (most frequently occurring class) of the polygon and its
neighbors.
Cluster: All polygons are categorized using five class intervals. If the class interval
of a polygon is different from each of its neighbors, the polygon is colored gray and
put into a sixth class to distinguish it from its neighbors.
Entropy: All polygons are categorized using five classes based on a natural
grouping of data values (smart quantiles). The value assigned to a polygon is the
entropy that is calculated from the polygon and its neighbors—that is,
where pi is the proportion of polygons that are assigned to each class. For
example, consider a polygon surrounded by four neighbors (a total of five
polygons). The values are placed into the corresponding classes:
Class Frequency Pi
1 3 3/5
2 0 0
3 1 1/5
4 0 0
5 1 1/5
Entropy class/frequency
Minimum entropy occurs when the polygon values are all located in the same class. Then,
Maximum entropy occurs when each polygon value is located in a different class interval.
Then,
Emax = -[0.2 * log2 (0.2) + 0.2 * log2 (0.2) + 0.2 * log2 (0.2) + 0.2 * log2 (0.2) + 0.2 * log2 (0.2)] = 2.322
Median: The value assigned to a polygon is the median value calculated from the
frequency distribution of the polygon and its neighbors.
Interquartile Range: The first and third quartiles are calculated from the frequency
distribution of a polygon and its neighbors. The value assigned to the polygon is the
interquartile range calculated by subtracting the value of the first quartile from the
value of the third quartile.
The Voronoi statistics can be used for different purposes and can be grouped into the
following general functional categories:
By exploring your data, you'll gain a better understanding of the spatial autocorrelation
among the measured values. This understanding can be used to make better decisions
when choosing models for spatial prediction.
Spatial autocorrelation
You can explore the spatial autocorrelation in your data by examining the different pairs of
sample locations. By measuring the distance between two locations and plotting the
difference squared between the values at the locations, a semivariogram cloud is created.
On the x-axis is the distance between the locations, and on the y-axis is the difference of
their values squared. Each dot in the semivariogram represents a pair of locations, not the
individual locations on the map.
If spatial correlation exists, pairs of points that are close together (on the far left of the x-
axis) should have less difference (be low on the y-axis). As points become farther away
from each other (moving right on the x-axis), in general, the difference squared should be
greater (moving up on the y-axis). Often there is a certain distance beyond which the
squared difference levels out. Pairs of locations beyond this distance are considered to be
uncorrelated.
A fundamental assumption for geostatistical methods is that any two locations that have a
similar distance and direction from each other should have a similar difference squared.
This relationship is called stationarity.
Spatial autocorrelation may depend only on the distance between two locations, which is
called isotropy. However, it is possible that the same autocorrelation value may occur at
different distances when considering different directions. Another way to think of this is that
things are more alike for longer distances in some directions than in other directions. This
directional influence is seen in semivariograms and covariances and is called anisotropy.
It is important to look for anisotropy so that if you detect directional differences in the
autocorrelation, you can account for them in the semivariogram or covariance models.
This in turn has an effect on the geostatistical prediction.
In the previous example, you used the Semivariogram/Covariance Cloud tool to look at the
general autocorrelation of the data. However, looking at the semivariogram surface, it
appears that there might be directional differences in the semivariogram values. When you
click Show search direction and set the angles and bandwidths as in the following figure,
you can see that the locations linked together have very similar values because the
semivariogram values are relatively low.
Directional variation
If you change the direction of the links as in the following figure, you can see that some
linked locations have values that are quite different, which result in the higher
semivariogram values. This indicates that locations separated by a distance of about
125,000 meters in the northeast direction are, on average, more different than locations in
the northwest direction. Recall that when variation changes more rapidly in one direction
than another, it is termed anisotropy. When interpolating a surface using the Geostatistical
Analyst wizard, you can use semivariogram models that account for anisotropy.
Modified directional variation
In the illustration above, each red dot shows the empirical semivariogram value (the
squared difference between the values of two data points making up a pair) plotted against
the distance separating the two points. You can brush the dots and see the linked pairs of
points in ArcMap.
A Semivariogram Surface with Search Direction capabilities is shown below. The values in
the semivariogram cloud are put into bins based on the direction and distance between a
pair of locations. These binned values are then averaged and smoothed to produce the
semivariogram surface. In the figure below, the legend shows the values between color
transitions. In this tool, you can input a lag size to control the size of the bins, and the
number of bins is determined by the number of lags you specify. The extent of the
semivariogram surface is controlled by lag size and number of lags.
You can view subsets of values in the semivariogram cloud by checking the Show search
direction box and clicking the direction controller to resize it or change its orientation.
You select the dataset and attribute using the following controls:
Pairs of points that are close together (to the left on the x-axis in the semivariogram)
should be more alike (low on the y-axis) than those that are farther apart (moving to the
right on the x-axis).
If the pairs of points in the semivariogram produce a horizontal straight line, there may be
no spatial correlation in the data and it would be meaningless to interpolate the data.
Pasos:
Click the point or polygon feature layer in the ArcMap table of contents that you
want to explore.
Click the Geostatistical Analyst arrow on the Geostatistical Analyst toolbar, click
Explore Data, then click Semivariogram/Covariance Cloud.
The Crosscovariance cloud shows the empirical crosscovariance for all pairs of locations
between two datasets and plots them as a function of the distance between the two
locations, as in the example shown below:
Crosscovariance cloud illustration
The Crosscovariance cloud can be used to examine the local characteristics of spatial
correlation between two datasets, and it can be used to look for spatial shifts in correlation
between two datasets. A crosscovariance cloud looks something like this:
Crosscovariance cloud example
In the illustration above, each red dot shows the empirical crosscovariance between a pair
of locations, with the attribute of one point taken from the first dataset and the attribute of
the second point taken from the second dataset. You can brush dots and see the linked
pairs in ArcMap. (To differentiate which of the pairs came from which dataset, set a
different selection color in the Properties dialog box of each dataset.)
A covariance surface with search direction capabilities is also provided in the tool. The
values in the crosscovariance cloud are put into bins based on the direction and distance
separating a pair of locations. These binned values are then averaged and smoothed to
produce a crosscovariance surface. The legend shows the colors and values separating
classes of covariance value. The extent of the crosscovariance surface is controlled by lag
size and number of lags that you specify.
Crosscovariance surface
Directional controller
You can view subsets of values in the Crosscovariance cloud by checking the Show
search direction option and dragging the directional controller to resize it or change its
orientation.
In this case, OZONE is the attribute field storing the ozone concentrations, and NO2AAM
is the attribute field storing the level of nitrogen dioxide.
You can click the arrow button Hide beside Data Sources to temporarily hide this part of
the tool.
The Crosscovariance Cloud tool can be used to investigate cross-correlation between two
datasets. Consider ozone (dataset 1) and NO 2 (dataset 2). Notice that the cross-
correlation between NO2 and ozone seems to be asymmetric. The red area shows that the
highest correlation between both datasets occurs when taking NO2 values that are shifted
to the west of the ozone values. The Search Direction tool will help identify the reasons for
this. When it is pointed toward the west, this is shown:
Search Direction tool (west)
It is clear that there are higher covariance values when the Search Direction tool is pointed
toward the west. You can use the Crosscovariance Cloud and Histogram tools to examine
which pairs contribute the highest cross-covariance values. If you use the Search Direction
tool pointed in the west direction and brush some of the high cross-covariance points in
the cloud, you can see that most of the corresponding data points are located in central
California. You can also see that the NO2 values are shifted to the west of the ozone
values. The histograms show that the high covariance values occur because both NO 2
(blue bars in the NO2 histogram) and ozone (orange bars in the ozone histogram) values
for the selected points are above the mean NO 2 and ozone values, respectively. From this
analysis, you have learned that much of the asymmetry in the cross-covariance is due to a
shift occurring because high NO2 values occur to the west of the high ozone values.
You could also obtain high cross-covariance values whenever the pairs selected from both
datasets have values that are below their respective means. In fact, you would expect to
see high cross-covariance values from pairs of locations that are both above and below
their respective means, and these would occur in several regions within the study area. By
exploring the data, you can identify that the cross-covariance in central California seems to
be different from that in the rest of the state. Based on this information, you might decide
that the results from Crosscovariance Cloud are due to a nonconstant mean in the data
and try to remove trends from both NO2 and ozone.
The Crosscovariance Cloud tool shows the covariance between two datasets. Use the
Search Direction tool to examine the cloud. Check if the covariance surface is symmetric
and the cross-covariance values are similar in all directions.
If you see that there is a spatial shift in the values of two datasets or unusually high cross-
covariance values, you can investigate where these occur. If you note that unusual cross-
covariance values occur for isolated locations or within restricted areas of your study site,
you may want to take some action, such as investigating those data values, detrending
data, or splitting the data into different strata before interpolating it.
Pasos:
Right-click the point feature layer identifying the first layer in the cross-covariance
analysis in the ArcMap table of contents and click Properties.
Click the symbol button (located under the with this symbol option).
Repeat steps 1–5 for the second layer to be used in the cross-covariance analysis,
but choose different selection sizes and colors.
Highlight the layers in the ArcMap table of contents by holding down CTRL while
clicking the two layers.
Choose the appropriate attribute for each layer in the Attribute list.
Click the center line of the Search Direction tool and drag it until it points to the
angle where you believe there is a shift.
Brush some points in the covariance cloud by clicking and dragging over some of
the red points. Examine where the pairs of points are on the ArcMap map.
3. Creating surfaces
The Geostatistical Analyst addresses a wide range of different application areas. The
following is a small sampling of applications in which Geostatistical Analyst was used.
Using measured sample points from a study area, Geostatistical Analyst was used to
create accurate predictions for other unmeasured locations within the same area.
Exploratory spatial data analysis tools included with Geostatistical Analyst were used to
assess the statistical properties of data such as spatial data variability, spatial data
dependence, and global trends.
A number of exploratory spatial data analysis tools were used in the example below to
investigate the properties of ozone measurements taken at monitoring stations in the
Carpathian Mountains.
Kriging
A number of kriging methods are available for surface creation in Geostatistical Analyst,
including ordinary, simple, universal, indicator, probability, and disjunctive kriging.
The following illustrates the two phases of geostatistical analysis of data. First, the
Semivariogram/Covariance wizard was used to fit a model to winter temperature data for
the United States. This model was then used to create the temperature distribution map.
Geostatistical Analyst application for winter temperature
Various types of map layers can be produced using Geostatistical Analyst, including
prediction maps, quantile maps, probability maps, and prediction standard error maps.
Probability maps can be generated to predict where values exceed a critical threshold.
In the example below, locations shown in dark orange and red indicate a probability
greater than 62.5 percent that radiocesium contamination exceeds the upper permissible
level (critical threshold) in forest berries.
Input data can be split into two subsets. The first subset of the available data can be used
to develop a model for prediction. The predicted values are then compared with the known
values at the remaining locations using the Validation tool.
The following shows the Validation wizard used to assess a model developed to predict
organic matter for a farm in Illinois.
In the following example, exploratory spatial data analysis tools are used to explore spatial
correlation between ozone (primary variable) and nitrogen dioxide (secondary variable) in
California. Because the variables are spatially correlated, cokriging can use the nitrogen
dioxide data to improve predictions when mapping ozone.
Geostatistical Analyst application for ozone in California
Generally speaking, things that are closer together tend to be more alike than things that
are farther apart. This is a fundamental geographic principle. Suppose you are a town
planner and need to build a scenic park in your town. You have several candidate sites,
and you may want to model the viewsheds at each location. This will require a more
detailed elevation surface dataset for your study area. Suppose you have preexisting
elevation data for 1,000 locations throughout the town. You can use this to build a new
elevation surface.
When trying to build the elevation surface, you can assume that the sample values closest
to the prediction location will be similar. But how many sample locations should you
consider? Should all of the sample values be considered equally? As you move farther
away from the prediction location, the influence of the points will decrease. Considering a
point too far away may actually be detrimental because the point may be located in an
area that is dramatically different from the prediction location.
One solution is to consider enough points to give a good prediction, but few enough points
to be practical. The number will vary with the amount and distribution of the sample points
and the character of the surface. If the elevation samples are relatively evenly distributed
and the surface characteristics do not change significantly across your landscape, you can
predict surface values from nearby points with reasonable accuracy. To account for the
distance relationship, the values of closer points are usually weighted more heavily than
those farther away. This principle is common to all the interpolation methods offered in
Geostatistical Analyst (except for global polynomial interpolation, which assigns equal
weights to all points).
You can assume that as locations get farther from the prediction location, the measured
values have less spatial autocorrelation with the prediction location. As these points have
little or no effect on the predicted value, they can be eliminated from the calculation of that
particular prediction point by defining a search neighborhood. It is also possible that distant
locations may have a detrimental influence on the predicted value if they are located in an
area that has different characteristics than those of the prediction location. A third reason
to use search neighborhoods is for computational speed. If you have 2,000 data locations,
the matrix would be too large to invert, and it would not be possible to generate a predicted
value. The smaller the search neighborhood, the faster the predicted values can be
generated. As a result, it is common practice to limit the number of points used in a
prediction by specifying a search neighborhood.
The specified shape of the neighborhood restricts how far and where to look for the
measured values to be used in the prediction. Additional parameters restrict the locations
that are used within the search neighborhood. The search neighborhood can be altered by
changing its size and shape or by changing the number of neighbors it includes.
The shape of the neighborhood is influenced by the input data and the surface that you are
trying to create. If there are no directional influences in the spatial autocorrelation of your
data (see Accounting for directional influences for more information), you will want to use
points equally in all directions, and the shape of the search neighborhood is a circle.
However, if there is directional autocorrelation or a trend in the data, you may want the
shape of your neighborhood to be an ellipse oriented with the major axis parallel to the
direction of long-range autocorrelation (the direction in which the data values are most
similar).
The search neighborhood can be specified in the Geostatistical Wizard, as shown in the
following example:
Maximum neighbors = 4
Minimum neighbors = 2
Sector type (search strategy): Circle with four quadrants with 45° offset; radius =
182955.6
The Weights section lists the weights that are used to estimate the value at the location
marked by the crosshair on the preview surface. The data points with the largest weights
are highlighted in red.
Once a neighborhood shape is specified, you can restrict which locations within the shape
should be used. You can define the maximum and minimum number of neighbors to
include and divide the neighborhood into sectors to ensure that you include values from all
directions. If you divide the neighborhood into sectors, the specified maximum and
minimum number of neighbors is applied to each sector.
Eight sectors
Kriging uses the data configuration specified by the search neighborhood in conjunction
with the fitted semivariogram model; weights for the measured locations can be
determined. Using the weights and the measured values, a prediction can be made for the
prediction location. This process is performed for each location within the study area to
create a continuous surface. Other interpolation methods follow the same process, but the
weights are determined using techniques that do not involve a semivariogram model.
The Smooth Interpolation option creates three ellipses. The central ellipse uses the Major
semiaxis and Minor semiaxis values. The inner ellipse uses these semiaxis values
multiplied by 1 minus the value for Smoothing factor, whereas the outer ellipse uses the
semiaxis values multiplied by 1 plus the smoothing factor. All the points within these three
ellipses are used in the interpolation. Points inside the smallest ellipse have weights
assigned to them in the same ways as for standard interpolation (for example, if the
method being used is inverse distance weighted interpolation, the points within the
smallest ellipse are weighted based on their distance from the prediction location). The
points that fall between the smallest ellipse and the largest ellipse get weights as
described for the points falling inside the smallest ellipse, but then the weights are
multiplied by a sigmoidal value that decreases from 1 (for points located just outside the
smallest ellipse) to 0 (for points located just outside the largest ellipse). Data points outside
the largest ellipse have zero weight in the interpolation. An example of this is shown
below:
Geostatistical wizard showing weights for data points
In Geostatistical Analyst, the weights for all nonkriging models are defined by a priori
analytic functions based on the distance from the prediction location. Most kriging models
predict a value using the weighted sum of the values of the nearby locations. Kriging uses
the semivariogram to define the weights that determine the contribution of each data point
to the prediction of new values at unsampled locations. Because of this, the default search
neighborhood used in kriging is constructed using the major and minor ranges of the
semivariogram model.
It is expected that a continuous surface is made from continuous data, such as
temperature observations, for example. However, all interpolators with a local searching
neighborhood generate predictions (and prediction standard errors) that can be
substantially different for nearby locations if the local neighborhoods are different. To see a
graphical representation of why this occurs, see Smooth interpolation.
The neighborhood search size defines the neighborhood shape and the constraints of the
points within the neighborhood that will be used in the prediction of an unmeasured
location.
You set the neighborhood parameters by looking for the locations of the points in the data
view window and using prior knowledge gained in ESDA and semivariogram/covariance
modeling.
The following are tips for altering the search neighborhood by changing the number of
neighbors:
Each sector will be projected outward if the minimum number of points is not found
inside the sector.
If there are no points within the searching neighborhood, then for most of the
interpolation methods, it will mean that a prediction cannot be made at that
location.
Although some interpolators, such as simple and disjunctive kriging, predict values
in areas without data points using the mean value of the dataset, a common
practice is to change the searching neighborhood so that some points are located
in the searching neighborhood.
Use the step below as a guide to changing the search neighborhood for any of the
interpolation methods offered in Geostatistical Wizard (except for global polynomial
interpolation). The step applies once you have defined the interpolation method and data
you want to use and have advanced through the wizard until you have reached the
Searching Neighborhood window.
Pasos:
Limit the number of neighbors to use for the prediction by changing the Maximum
neighbors and Minimum neighbors parameters.
These parameter control the number of neighbors included in each sector of the search
neighborhood. The number and orientation of the sectors can be changed by altering the
Sector type parameter.
Sugerencia: The impact of the search neighborhood can be assessed using the cross-
validation and comparison tools that are available in Geostatistical Analyst. If necessary,
the search neighborhood can be redefined and another surface created.
3.2.4. Altering the search neighborhood by changing its size and shape
The neighborhood search size defines the neighborhood shape and the constraints of the
points within the neighborhood that will be used in the prediction of an unmeasured
location.
You set the neighborhood parameters by looking for the locations of the points in the data
view window and using prior knowledge gained in ESDA and semivariogram/covariance
modeling.
Use the steps below as a guide to changing the search neighborhood for any of the
interpolation methods offered in Geostatistical Wizard (except for global polynomial
interpolation).
Pasos:
Sector type is used to alter the type of searching neighborhood by choosing from a
predefined list.
Type an Angle value between 0 and 360 degrees or choose from the drop-down
list. This is the orientation of the major semiaxis, measured in degrees from north.
The Major semiaxis and Minor semiaxis parameters are used to alter the shape of
the ellipse. The desired shape appears in the display window once these values
are entered.
The preview map is provided for all interpolation methods in the wizard, and it can be
manipulated using the controls in the toolbar above it.
Pasos:
Click the Zoom In button above the map view, then drag a box around the area
of the map on which the zoom will occur.
Click the Zoom Out button above the map view, then drag a box around the
area of the map on which the zoom will occur.
Click the Pan button and move the pointer into the map display, click and hold,
then move the pointer to pan around the map display. The map moves in
coordination with the pointer.
Click the Full Extent button to display the map using the full extent.
Click the Show Layers button and choose which features to display.
Click the Back arrow or the Forward arrow to display the previous or next
extent.
Click the Change points size button to change the symbology of the input
points.
Click the Show Legend button to toggle between showing and hiding the map
legend.
Click the Identify value button and click a location on the map to make a
prediction and to highlight which points are used to make this prediction.
4.6.3. Histogramas
4.6.4. Argumentos de QQ normales y generales