Instead of estimating the density or intensity of points across an area, we could instead try and model where points will be using some statistical model combined with some spatial covariates.
The idea is that we model the intensity of points \(\lambda(\mathbf{x})\) at a location \(\mathbf{x}\) on a log scale with a linear predictor of spatial covariates \(v_i(\mathbf{x})\)
\[ \log{\lambda(\mathbf{x})} = \beta_0 + \beta_1 v_1(\mathbf{x}) + \cdots + \beta_k v_k(\mathbf{x}), \] where \(\beta_i\) are parameters to be estimated.
This is essentially a parametric model of an inhomogeneous Poisson process whose intensity is conditional on a set of spatial covariates.
Note that Poisson processes assume that there is no interaction between locations: These models are not appropriate for modelling an infectious disease where locations will cluster around each other.
We can model this sort of point pattern, and many more, using the
ppm
function in spatstat
(Baddeley, Rubak, and Turner 2015).
We’ll start by trying to model the case distribution of primary
biliary cirrhosis in north-eastern England collected between 1987 and
1994 from the pbc
dataset in sparr
:
library(tidyverse)
library(sf)
library(sparr)
data(pbc)
cases <- split(pbc)$case
cases.df <- as.data.frame(cases)
cases.win <- as.data.frame(pbc$window)
ggplot(data=cases.df) +
geom_point(aes(x=x, y=y), alpha=0.5) +
geom_polygon(data=cases.win, aes(x=x, y=y), fill='transparent', col='black') +
coord_fixed() +
theme_void()
We can see that this point pattern is clearly not homogeneous, but nonetheless, let’s try modelling it as such.
library(spatstat)
mod.const <- ppm(cases ~ 1)
mod.const
## Stationary Poisson process
## Fitted to point pattern dataset 'cases'
## Intensity: 0.09472343
## Estimate S.E. CI95.lo CI95.hi Ztest Zval
## log(lambda) -2.356794 0.03624997 -2.427843 -2.285745 *** -65.01506
We can see that it fits a stationery point process with intensity \(\lambda=0.0947\). We can get this from the summary table where we see \(\log\lambda = -2.357\), so that \(\lambda = \exp(-2.357) = 0.0947\).
This makes sense, as it is just the number of points divided by the area:
npoints = cases$n
area = st_as_sfc(pbc$window) |> st_area()
intensity = npoints / area
intensity
## [1] 0.09472343
A more useful model might try and use the spatial coordinates \(x\) and \(y\) to model the process. We might try a linear trend in \(x\) and \(y\) first:
mod.linear <- ppm(cases ~ x + y)
mod.linear
## Nonstationary Poisson process
## Fitted to point pattern dataset 'cases'
##
## Log intensity: ~x + y
##
## Fitted trend coefficients:
## (Intercept) x y
## -36.636173988 0.080226792 0.001614879
##
## Estimate S.E. CI95.lo CI95.hi Ztest Zval
## (Intercept) -36.636173988 1.720894422 -4.000907e+01 -33.263282901 *** -21.289031
## x 0.080226792 0.003009293 7.432869e-02 0.086124899 *** 26.659679
## y 0.001614879 0.001282186 -8.981586e-04 0.004127918 1.259474
We can see here that \(x\) seems to be important (low P-value), while \(y\) doesn’t. We could produce a plot of the resulting intensity to see how it looks:
pred.df <- mod.linear |> predict() |> as.data.frame()
ggplot(pred.df) +
geom_raster(aes(x=x, y=y, fill=value)) +
geom_point(data=cases.df, aes(x=x, y=y), shape="circle filled", alpha=0.5) +
coord_fixed() +
scale_fill_viridis_c() +
theme_void()
As expected, the fit isn’t very good. We might try a second order polynomial in \(x\) and \(y\) perhaps:
mod.quad <- ppm(cases ~ polynom(x, y, 2))
mod.quad |> predict() |> as.data.frame() |>
ggplot() +
geom_raster(aes(x=x, y=y, fill=value)) +
geom_point(data=cases.df, aes(x=x, y=y), shape="circle filled", alpha=0.5) +
coord_fixed() +
scale_fill_viridis_c() +
theme_void()
This is getting closer!
A much smarter thing to do is to use covariates that might actually be driving the process, rather than just modelling the intensity in terms of the coordinate locations \(x\) and \(y\).
In this case, we have the control data that we could use. We’ll need to generate a pixel image of the density (or intensity) of controls to use as a predictor:
controls <- split(pbc)$control
controls.dens <- bivariate.density(controls, adapt=TRUE, h0=3.5, verbose=FALSE)
## Registered S3 methods overwritten by 'spatstat.univar':
## method from
## mean.ecdf spatstat.geom
## mean.ewcdf spatstat.geom
## print.ewcdf spatstat.geom
## quantile.density
## quantile.ewcdf spatstat.geom
controls.im <- controls.dens$z
mod.controls <- ppm(cases ~ log(controls.im))
mod.controls
## Nonstationary Poisson process
## Fitted to point pattern dataset 'cases'
##
## Log intensity: ~log(controls.im)
##
## Fitted trend coefficients:
## (Intercept) log(controls.im)
## 9.165907 1.328802
##
## Estimate S.E. CI95.lo CI95.hi Ztest Zval
## (Intercept) 9.165907 0.25493426 8.666245 9.665569 *** 35.95400
## log(controls.im) 1.328802 0.03380425 1.262547 1.395058 *** 39.30874
As expected, there is a highly significant positive relationship between the log density (intensity) of controls and that of cases. This highlights the importance of taking into account the distribution of people when we look at the density of features that people might have (e.g. disease).