Modelling point patterns

Instead of estimating the density or intensity of points across an area, we could instead try and model where points will be using some statistical model combined with some spatial covariates.

The idea is that we model the intensity of points \(\lambda(\mathbf{x})\) at a location \(\mathbf{x}\) on a log scale with a linear predictor of spatial covariates \(v_i(\mathbf{x})\)

\[ \log{\lambda(\mathbf{x})} = \beta_0 + \beta_1 v_1(\mathbf{x}) + \cdots + \beta_k v_k(\mathbf{x}), \] where \(\beta_i\) are parameters to be estimated.

This is essentially a parametric model of an inhomogeneous Poisson process whose intensity is conditional on a set of spatial covariates.

Note that Poisson processes assume that there is no interaction between locations: These models are not appropriate for modelling an infectious disease where locations will cluster around each other.

We can model this sort of point pattern, and many more, using the ppm function in spatstat (Baddeley, Rubak, and Turner 2015).

Data

We’ll start by trying to model the case distribution of primary biliary cirrhosis in north-eastern England collected between 1987 and 1994 from the pbc dataset in sparr:

library(tidyverse)
library(sf)
library(sparr)
data(pbc)
cases <- split(pbc)$case

cases.df <- as.data.frame(cases)
cases.win <- as.data.frame(pbc$window)

ggplot(data=cases.df) +
  geom_point(aes(x=x, y=y), alpha=0.5) +
  geom_polygon(data=cases.win, aes(x=x, y=y), fill='transparent', col='black') +
  coord_fixed() +
  theme_void()

Model with a homogeneous Poisson process

We can see that this point pattern is clearly not homogeneous, but nonetheless, let’s try modelling it as such.

library(spatstat)

mod.const <- ppm(cases ~ 1)
mod.const

## Stationary Poisson process
## Fitted to point pattern dataset 'cases'
## Intensity: 0.09472343
##              Estimate       S.E.   CI95.lo   CI95.hi Ztest      Zval
## log(lambda) -2.356794 0.03624997 -2.427843 -2.285745   *** -65.01506

We can see that it fits a stationery point process with intensity \(\lambda=0.0947\). We can get this from the summary table where we see \(\log\lambda = -2.357\), so that \(\lambda = \exp(-2.357) = 0.0947\).

This makes sense, as it is just the number of points divided by the area:

npoints = cases$n
area = st_as_sfc(pbc$window) |> st_area()

intensity = npoints / area
intensity

## [1] 0.09472343

Models that depend on location

A more useful model might try and use the spatial coordinates \(x\) and \(y\) to model the process. We might try a linear trend in \(x\) and \(y\) first:

mod.linear <- ppm(cases ~ x + y)
mod.linear

## Nonstationary Poisson process
## Fitted to point pattern dataset 'cases'
## 
## Log intensity:  ~x + y
## 
## Fitted trend coefficients:
##   (Intercept)             x             y 
## -36.636173988   0.080226792   0.001614879 
## 
##                  Estimate        S.E.       CI95.lo       CI95.hi Ztest       Zval
## (Intercept) -36.636173988 1.720894422 -4.000907e+01 -33.263282901   *** -21.289031
## x             0.080226792 0.003009293  7.432869e-02   0.086124899   ***  26.659679
## y             0.001614879 0.001282186 -8.981586e-04   0.004127918         1.259474

We can see here that \(x\) seems to be important (low P-value), while \(y\) doesn’t. We could produce a plot of the resulting intensity to see how it looks:

pred.df <- mod.linear |> predict() |> as.data.frame()

ggplot(pred.df) +
  geom_raster(aes(x=x, y=y, fill=value)) +
  geom_point(data=cases.df, aes(x=x, y=y), shape="circle filled", alpha=0.5) +
  coord_fixed() +
  scale_fill_viridis_c() +
  theme_void()

As expected, the fit isn’t very good. We might try a second order polynomial in \(x\) and \(y\) perhaps:

mod.quad <- ppm(cases ~ polynom(x, y, 2))

mod.quad |> predict() |> as.data.frame() |>
  ggplot() +
  geom_raster(aes(x=x, y=y, fill=value)) +
  geom_point(data=cases.df, aes(x=x, y=y), shape="circle filled", alpha=0.5) +
  coord_fixed() +
  scale_fill_viridis_c() +
  theme_void()

This is getting closer!

Modelling with covariates

A much smarter thing to do is to use covariates that might actually be driving the process, rather than just modelling the intensity in terms of the coordinate locations \(x\) and \(y\).

In this case, we have the control data that we could use. We’ll need to generate a pixel image of the density (or intensity) of controls to use as a predictor:

controls <- split(pbc)$control
controls.dens <- bivariate.density(controls, adapt=TRUE, h0=3.5, verbose=FALSE)

## Registered S3 methods overwritten by 'spatstat.univar':
##   method           from         
##   mean.ecdf        spatstat.geom
##   mean.ewcdf       spatstat.geom
##   print.ewcdf      spatstat.geom
##   quantile.density              
##   quantile.ewcdf   spatstat.geom

controls.im <- controls.dens$z

mod.controls <- ppm(cases ~ log(controls.im))
mod.controls

## Nonstationary Poisson process
## Fitted to point pattern dataset 'cases'
## 
## Log intensity:  ~log(controls.im)
## 
## Fitted trend coefficients:
##      (Intercept) log(controls.im) 
##         9.165907         1.328802 
## 
##                  Estimate       S.E.  CI95.lo  CI95.hi Ztest     Zval
## (Intercept)      9.165907 0.25493426 8.666245 9.665569   *** 35.95400
## log(controls.im) 1.328802 0.03380425 1.262547 1.395058   *** 39.30874

As expected, there is a highly significant positive relationship between the log density (intensity) of controls and that of cases. This highlights the importance of taking into account the distribution of people when we look at the density of features that people might have (e.g. disease).

References