The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
A covariate often varies within each areal unit —
elevation, distance to a road, pollution from point sources. The usual
shortcut is to average it over each polygon and put the average into a
Poisson model. Under the nonlinear log link this is
biased. This tutorial shows why, and how
SDALGCP2 does it correctly. It is self-contained.
A region’s expected count under the log-Gaussian Cox process is \[
\mathbb{E}[Y_i\mid S] \;=\; m_i\int_{A_i} w_i(x)\,\exp\{z(x)^\top\beta +
S(x)\}\,dx
\;\approx\; m_i\sum_{k} w_{ik}\,\exp\{z(x_{ik})^\top\beta + S(x_{ik})\},
\] a sum over candidate points \(x_{ik}\) inside \(A_i\) with weights \(w_{ik}\). The covariate enters
inside the exponential. Factor it out: \[
\mathbb{E}[Y_i\mid S] \;\approx\; m_i\,e^{\,b_i(\beta)}\sum_k
c_{ik}\,e^{S(x_{ik})},
\qquad
\boxed{\,b_i(\beta)=\log\sum_k w_{ik}\,\exp\{z(x_{ik})^\top\beta\}\,}
\] The region’s covariate contribution is the
log-sum-exp \(b_i(\beta)\), not \((\bar z_i)^\top\beta\) with \(\bar z_i\) the areal mean. By Jensen’s
inequality the two differ whenever \(z\) varies within the region, and a
regression on \(\bar z_i\) is biased
for \(\beta\) — badly so when sharp
features (e.g. point sources) sit inside regions. Full derivation: math/raster-covariates-derivation.pdf.
A covariate z with sharp peaks (think pollution
sources), supplied as a terra::SpatRaster, plus outcome
counts over an 8×8 lattice. The peaks sit inside regions, so
the region mean loses them:
library(SDALGCP2)
library(sf)
library(terra)
set.seed(3)
# a raster covariate with 14 sharp Gaussian peaks
r <- rast(xmin = 0, xmax = 20, ymin = 0, ymax = 20, resolution = 0.08)
xy <- xyFromCell(r, 1:ncell(r))
src <- matrix(runif(2 * 14, 1, 19), ncol = 2)
values(r) <- rowSums(sapply(seq_len(nrow(src)), function(s)
3.2 * exp(-((xy[, 1] - src[s, 1])^2 + (xy[, 2] - src[s, 2])^2) / 0.5)))
names(r) <- "z"
sh <- st_sf(geometry = st_make_grid(
st_as_sfc(st_bbox(c(xmin = 0, ymin = 0, xmax = 20, ymax = 20))), n = c(8, 8)))
N <- nrow(sh)
# simulate counts from the point-level intensity model (true beta_z = 1)
pts <- sda_points(sh, delta = 0.5, method = 3); w <- lapply(pts, function(p) p$weight)
Z <- lapply(pts, function(p) cbind(1, terra::extract(r, as.matrix(p$xy))[, "z"]))
b_true <- sapply(seq_len(N), function(i) log(sum(w[[i]] * exp(as.numeric(Z[[i]] %*% c(-6, 1))))))
sh$pop <- round(runif(N, 2000, 6000))
sh$cases <- rpois(N, sh$pop * exp(b_true))
sh$zbar <- sapply(seq_len(N), function(i) sum(w[[i]] * Z[[i]][, 2])) # areal mean| The raster covariate (varies within regions) | What areal averaging keeps |
|---|---|
The right panel — the region mean — has washed the peaks out.
The naive analysis regresses counts on the areal mean:
naive <- glm(cases ~ zbar + offset(log(pop)), poisson, st_drop_geometry(sh))
coef(naive)["zbar"]
#> 1.67 <- 67% too largeSDALGCP2 instead reads z from the raster at
the candidate points and uses the log-sum-exp offset. Pass the raster;
sh does not even need a z column:
fit <- sdalgcp(cases ~ z + offset(log(pop)), data = sh, rasters = r)
coef(naive)["zbar"]; fit$beta_opt["z"]#> Estimate of beta_z:
#> truth 1.00
#> naive areal average 1.67 (+67% bias)
#> SDALGCP2 (intensity-scale) 1.09
Averaging the predictor over polygons overstates the effect by two-thirds; aggregating on the intensity scale recovers it.
Because \(b_i(\beta)\approx\bar z_i^\top\beta+\tfrac12\beta^\top\widehat{\mathrm{Var}}_i(z)\beta\), the bias is driven by the within-region variance of \(z\). It is large when that variance is correlated with the region mean — sharp, localised features such as point sources. For smooth covariates the two approaches nearly agree, and either is fine.
As in Tutorial 1, the fit yields relative risk and covariate-adjusted relative risk with standard errors and exceedance, both discrete and continuous:
plot(fit, "adjusted_rr") # covariate-adjusted relative risk
plot(fit, "exceedance", threshold = 1.5)sdalgcp_control(tilt_spatial = TRUE) additionally tilts
the spatial correlation by the covariate intensity (the fully tilted
model of the PDF); the default keeps the correlation covariate-free and
is faster. ```
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.