geoimp

The goal of {geoimp} is to provide an R framework to geoimpute or pseudocode spatial point data. The reason you might want to do this is if your results from geocoding do not (fully) represent true locations. Geocoding results can be incomplete, inaccurate or aggregated to the closest areal polygon. Geoimputation is a method to make reasonable estimates of the true locations of faulty geocoded data by considering statistical assumptions or auxiliary data. The main functions include:

wt_strata: Weighted two-stage stratified random spatial sampling
wt_sample: Weighted spatial random sampling
wt_centroid: Weighted centroids

Installation

You can install the development version of geoimp like so:

# install.packages("pak")
pak::pkg_install("JsLth/geoimp")

Basic example

Imagine you have georeferenced survey data containing items on income and rurality.

library(geoimp)
library(sf)
library(terra)

dummy <- dummy_data()
dummy$survey
#> Simple feature collection with 50 features and 2 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 1.166667 ymin: 1.166667 xmax: 2.833333 ymax: 2.833333
#> Projected CRS: +proj=tmerc +lat_0=0 +lon_0=0 +k=1 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs
#> First 10 features:
#>    place_type    income                  geometry
#> 1     Village 3001-4000 POINT (2.833333 1.166667)
#> 2     Village 3001-4000 POINT (2.833333 1.166667)
#> 3     Village 1001-2000 POINT (2.833333 1.166667)
#> 4        City    <=1000 POINT (2.833333 1.166667)
#> 5     Village 2001-3000 POINT (2.833333 1.166667)
#> 6     Village    <=1000 POINT (2.833333 1.166667)
#> 7        City    <=1000 POINT (2.833333 1.166667)
#> 8        Town 3001-4000 POINT (2.833333 1.166667)
#> 9     Village     >4000 POINT (2.833333 1.166667)
#> 10       City    <=1000 POINT (2.833333 1.166667)

As you can see, the coordinates are all the same, x: 2 / y: 2. The reason for this is that in the original survey dataset, each respondent was only assigned an administrative area identifier. Each respondent was georeferenced by linking these administrative area by their identifiers. This is a common step in survey research that ensures anonymity of survey respondents. The downside is that it results in something like this:

library(ggplot2)

ggplot() +
  geom_sf(data = dummy$areas, fill = NA) +
  geom_sf(data = dummy$survey, aes(color = income), size = 2) +
  scale_color_discrete("Income") +
  theme_void()

Each coordinate is the centroid (the center point) of their assigned administrative area. This means you cannot make out any spatial patterns, you cannot map the results and, most importantly, you cannot perform any spatial analysis or spatial modeling. Also, looking at where people actually live, we see:

library(tidyterra)

ggplot() +
  geom_spatraster(data = dummy$pop, aes(fill = pop)) +
  geom_sf(data = dummy$survey, aes(color = income), size = 2) +
  scale_fill_viridis_c("Population") +
  scale_color_discrete("Income") +
  guides(fill = guide_legend(order = 1), color = guide_legend(order = 2)) +
  theme_void()

Neither of the centroids represent where the people in the survey live nor where people generally live. A traditional way to handle this is to compute a population-weighted centroid.

pop_centroid <- wt_centroid(dummy$areas, dummy$pop)

ggplot() +
  geom_spatraster(data = dummy$pop, aes(fill = pop)) +
  geom_sf(data = pop_centroid, size = 2) +
  scale_fill_viridis_c("Population") +
  theme_void()

This solves one problem: centroids having no relation to their underlying context. But the points are still stacked on top of each other. Another way could be to do spatial sampling using population weights.

set.seed(123)
pop_sample <- wt_strata(dummy$survey, dummy$pop, psu = dummy$areas, method = "random")

ggplot() +
  geom_spatraster(data = dummy$pop, aes(fill = pop)) +
  geom_sf(data = pop_sample, aes(color = place_type), size = 2) +
  scale_fill_viridis_c("Population") +
  scale_color_discrete("Place type") +
  theme_void()

Now each respondent is spread out and the randomness is sort of controlled by specifying population weights. This allows us to do spatial analysis. But is it really meaningful? After all, we can see that some respondents who said they live in a village are sampled near the city center. Clearly, knowing where people generally live is not always enough.

The wt_strata() function allows you to link external weights data with information from survey respondents. The goal is to sample people who say they live in low-density areas only in low-density areas. Or people who say they live in low-income areas only in low-income areas. Or other combinations. To do this, you need a “lookup table” that links contextual information to surveyed information. In other words, which population density corresponds to which value in the survey item?

lookup <- create_lookup(dummy$pop, dummy$survey$place_type)
lookup
#> Village = [50, 280)
#> Town    = [280, 500)
#> Suburb  = [500, 1700)
#> City    = [1700, 2500]

The default in create_lookup() is to use quantile breaks but it probably makes sense to consult existing typologies or base this off empirical results. When using this in wt_strata(), points are sampled in grid cells of the population raster ranging between 50 and 280 people if the survey respondent said they live in a village.

pop_geoimp <- wt_strata(
  dummy$survey,
  dummy$pop,
  psu = dummy$areas,
  ssu = dummy$survey$place_type,
  lookup = lookup,
  method = "random",
  replace = TRUE
)

ggplot() +
  geom_spatraster(data = dummy$pop, aes(fill = pop)) +
  geom_sf(data = pop_geoimp, aes(color = place_type), size = 2) +
  scale_fill_viridis_c("Population") +
  scale_color_discrete("Place type") +
  theme_void()

As you can see, the sampled points are not clustered around the population center anymore. Pink points (city dwellers) are clustered around the population center and orange points (rural dwellers) are clustered in the top-right corner where population density is low.

Of course, neither the raster nor the lookup table have to reference population data. The same could be done with any survey attribute that can be reasonably matched to any context data. For example, given a raster on income levels per cell:

as_lookup(list(
  "<=1000" = c(0, 1000),
  "1001-2000" = c(1000, 2000),
  "2001-3000" = c(2000, 3000),
  "3001-4000" = c(3000, 4000),
  ">4000" = c(4000, Inf)
), closure = "right")
#> <=1000    = [0, 1000]
#> 1001-2000 = (1000, 2000]
#> 2001-3000 = (2000, 3000]
#> 3001-4000 = (3000, 4000]
#> >4000     = (4000, Inf]

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
R		R
data		data
inst		inst
man		man
src		src
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
geoimp.Rproj		geoimp.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

geoimp

Installation

Basic example

About

Uh oh!

Releases

Packages

Languages

License

JsLth/geoimp

Folders and files

Latest commit

History

Repository files navigation

geoimp

Installation

Basic example

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages