Logistic Regression
• Often, the spatial phenomenon under investigation can
only be described by a categorical variable.
– Wild fires typically depicted with polygons showing burned vs.
not burned
– Or, bird distribution indicating presence or absence of birds
• Previous regression technique is not suitable because the
dependent variable is neither interval or ratio
• Logistic regression treats the distribution in a
probabilistic manner, that is, the occurrence of the study
phenomenon is evaluated in terms of probability
Logistic Regression
• If the probability of presence of a phenomenon is P a, then Pb
represents the absence of the phenomenon and
Pa + Pb = 1
Ua = 0 + 1XEXP
1 + 2(X
U2 a+) …+ nXn +
Pa
1 EXP(U )
Ua is the utility function ofa event a expressed as a linear
combination of a number of explanatory variables X1, X2, .., and
n is the estimated parameter of variable Xn
Logistic Regression
• A greater value of Ua implies a greater
probability for the event to take place. When
Ua approaches infinity, Pa approaches 1,
indicating a high likelihood for the event to
occur. When Ua approaches negative infinity,
Pa approaches 0.
• When Ua equals zero, the probability is .50,
implying a 50/50 chance for the event to occur.
Logistic Regression Example
• Example from Chou
• Fires in San Jacinto Ranger District of the San
Bernardino National Forest were examined to
map the distribution of fire occurrence
probability. The basic model consisted of eight
independent variables
– Area, perimeter, vegetation, proximity to buildings,
proximity to campgrounds, proximity to roads,
maximum temperature in July, and annual precipitation
Variables in Fire Distribution
Study
• X1 Area: area of geographic unit
X2 Perimeter: perimeter of geographic unit
X3 Vegetation: vegetation computed by rotation period
X4 Building: proximity to structures
X5 Campground: proximity to campgrounds
X6 Road: proximity to roads
X7 Temperature: maximum temperature in July
X8 Precipitation: annual precipitation
• Dependent variable is a code indicating whether or not a geographic unit is burned
or not. Area and perimeter provide general geometric characteristics. Vegetation,
precipitation, and temperature represent environmental factors, while building,
campground, and road represent human-related factors
Results of Logistic Regression
Variable Coefficient Chi-square P-Value
• The model indicates X0 -6.3246 31.13 0
X1 0 1.42 0.234
that perimeter, X2 -0.0002 8.13 0.0043
vegetation, campground, X3 1.5577 43.65 0
road, and temperature X4 -1.1451 1.93 0.1648
X5 -294.58 4.61 0.0318
are variables to be X6 -0.5244 4.46 0.0348
included in the model. X7 0.179 28.19 0
Other variables are not X8 0.0023 0.21 0.6493
included as they are not Log Likelihood -1366
statistically different from PCE 60
0 Chi-square 0.384 for alpa = .05
Results of Logistic Regression
• Percentage-correctly-estimated (PCE)
index shows the maximum level of
estimation accuracy of a model.
• In this example, PCE is 60%, not much
better than a random 50/50 chance.
• Therefore, another parameter was
evaluated…
Alternative Model
• Included an additional variable to determine whether
it makes any significant difference in model
performance
– New variable represents neighborhood effects, or conditions
of the surrounding geographic units
– Assumes that fire occurrence probability is not only affected
by the environmental and human-related variables listed in
the basic model, but by the distribution of fire occurrence
probability of adjacent units
– The new spatial term X9 is defined by the percentage of
neighboring units that were burned during the study period
New Results
• Results from the new study are
quite different
• Only two variables are statistically X1 0 1.03 0.3106
significant: vegetation and X2 -0.0003 0.97 0.3249
neighborhood effects X3 -1.6738 6.88 0.0087
X4 -0.8416 0.19 0.6669
• Vegetation appears to be the
X5 -42.28 0 0.9701
determining environmental X6 1.0241 3 0.0831
variable in the distribution of X7 -0.1121 1 0.3168
wildfires in the study area X8 -0.0127 0.55 0.4597
X9 17.951 2359.3 0
• Finally, wildfires are influenced by
neighborhood conditions Log Likelihood -164.788
PCE 97
Chi-square 3.84 for alpa = .05
Testing Statistical Signficance
• Did the neighborhood effects significantly change the model? Need to
test the chi-square test of likelihood ratio
L0
1
• Where L0 denotes the likelihood of the basic L
model
1 and L 1 denotes the
likelihood of the study model
Log L0 L1
• Statistical testing suggests
1198 .283 that the.197
1366 neighborhood variable significantly
167.914
improved the performance of the model
2 Log 2396.566
Procedure for Regression
Analysis (Barber, p. 448)
• Specify the variables in the model and the exact form of
the relationship between them
• Collect data
• Estimate the parameters of the model
• Statistically test the utility of the developed model, and
check whether the assumptions of the simple linear
regression model are satisfied
• Use the model for prediction
Example of Data
Manipulation and
Programming in ArcView
Manipulating Yield Data with
DataManipulation.ave
Spatial Prediction of
Landslide Hazard Using
Logistic Regression and GIS
Art Lembo
620 Presentation
Based on paper by Gorsevski,
Gessler, and Folz
Introduction
• Landslides are natural geologic processes
that cause different types of damage,
causing billions of dollars in damage and
thousands of deaths each year
• 95% of landslides occur in developing
countries
Causes of Landslides
• Human activities, such as deforestation and
urban expansion, accelerate the process of
landslides
• Roads and harvest activities in timberlands
increase the occurrence of landslides
• In undisturbed forest, soil erosion is generally
negligible
Clearwater National Forest
• 1995-1996
– Major landslides occurred during the winter following
heavy rains, snowmelt, and high river flow
– Over 900 landslides were recorded on the unstable
slopes of the forest
– Landslide occurrence was widely distributed and
included artificial slopes such as road cuts and fills, or
natural slopes in clearcut areas
Landslide Data
• Within the large remote area, a DEM was
used to generate quantitative topographic
attributes
– Slope, elevation, aspect, profile, curvature,
tangent curvature, plan curvature, flow path,
and contributing area
• Photo interpretation and field inventory
identified landslide areas
Considerations in Creating
Hazard Models
• Datasets combined and stored in a GIS database
• Hazard Model assumptions
– Strength of a model depends on the quality of the data
collected
– Data driven models are not appropriate to extrapolate to
neighboring areas
– Climatic conditions may change so that the past is not an
indicator of the future
• Uncertainty exists when a hazard map is derived from a
statistically based model
Models Used in Study
• Logistic regression was used, which
correlated the environmental attributes
and landslide distribution
• Because of the existence of uncertainty, a
Receiver-Operating Curve curve plots the
proportion of false positives against the
true positives at each level of the criterion
Assessing Landslide Hazard
• Field inspection using a check list to identify sites susceptible
to landsliding
• Projection of future patterns of instability from analysis of
landslide inventories
• Multivariate analysis of factors characterizing observed sites
of slope instability
• Stability ranking based on criteria such as slope, land forms,
or geologic structure
• Failure probability analysis based on slope stability models
with stochastic hydrologic simulation
Preparing the Data
• Primary and secondary attributes are derived from a
DEM, reducing the high cost of collecting the data
(30m)
• Landslides assessed through aerial reconnaissance
• Landslide hazard area are then identified based on
spatial correlation between the attributes
• Identifying landslide hazard is based on spatial
correlation between the attributes derived from the
DEM
• ROC curves used for decision making
Data Sampling
• 15% of non-landslide cells were randomly sampled for an
absence of landslides
– Multivariate subset was derived from the coverages where
landslides were absent
• The landslide coverage was a point data set sampled grid
cells where landslides were present
• Both samples were joined together where the dependent
variable had a binary response (present or absent)
• Final output stored in ASCII and used in SAS
Statistical Analysis
• Normal plot of data to determine if the data followed a normal
Normal plot of data to determine if the data followed a normal
distribution
– Plot showed that data points do not fall along a straight line. The data is
not multivariate normal
• Logistic regression is used
when the predictor variables
are not normally distributed,
and some predictor variables
are categorical
• Factor analysis was
applied to determine the
number of underlying variables
– Only significantly loaded variables were considered
Statistical Analysis
• The form of the logistic regression model is defined
as:
• Where x is the data vector for a randomly selected
experimental unit and y is the value of the binary
outcome variable. Maximum likelihood was used to
estimate B for the predictive equation
• Variables not significant at the .1 level were eliminated
Logit Results
• Logit showed that the most important variables
contributing to the slope instability were Flow
Path and mean slope of upland area
• log (p/(1-p)) = (-2.2642 + FACTOR8 * 0.4969 + FLPATH * 0.6039)
or
p = exp (-2.2642 + FACTOR8 * 0.4969 + FLPATH * 0.6039)/(1 + exp(-2.2642
+ FACTOR8 * 0.4969 + FLPATH * 0.6039)
______________________________________________________________
p – probability of landslide hazard
FACTOR8 – factor with underlying characteristics of aspect
FLPATH – Maximum distance of water to the point in the catchment
Logit Results
• Coefficients of Logit model included positive
coefficients. Therefore, higher scores would increase
the probability of landslide hazard.
• Logit model assumes a nonlinear relationship between
the probability and the explanatory variables
• Hazard map based on ROC curve technique groups the
hazard into two classes: Low Hazard and High Hazard,
showing five classes of probabilities of landslide hazard
Final Results
• 59.1% of the landslides and 69.8% of non
landslides were correctly determined
• Model can be applied to large geographic areas
• ROC curves are incorporated as a sophisticated
tool for decision makers for the spatial prediction
of landslide hazard
a) Cut-off based on ROC curve technique b) Probability of
landslide hazard