Peter Ralph
8 March 2021 – Advanced Biological Statistics
::: {.columns} ::::::: {.column width=60%}
Suppose we have a bunch of points
\[ (x_1, y_1), \ldots, (x_n, y_n) \]
with density \(f(x,y)\), i.e.,
\[ f(x,y) \approx \frac{\#(\text{of points within $r$ of $(x,y)$})}{\pi r^2} \]
Using
\[ \rho_\sigma(x,y) = \frac{1}{2\pi\sigma^2} e^{-\frac{x^2+y^2}{2\sigma^2}} \]
we can estimate \(f(x,y)\) by
\[ \hat f(x,y) = \sum_{i=1}^n \rho_\sigma(x_i-x, y_i-y) . \]
The kernel, \(\rho()\).
(we used a Gaussian)
The bandwidth, \(\sigma\).
Simulate 200 spatial locations with a density having two “bumps”. Plot these points. (rnorm(n, mean=-3), rnorm(n, mean=+3)
)
Make a \(20 \times 20\) grid of “reference” locations. (expand.grid( )
)
Compute the kernel density estimate for each location, with \(\sigma\) chosen appropriately, and plot it. (dnorm( )
)
bkde2D
library(KernSmooth)
grid_n <- 301
start_smooth <- bkde2D(bt_start@coords, bandwidth=0.004, gridsize=c(grid_n, grid_n))
str(start_smooth)
## List of 3
## $ x1 : num [1:301] -123 -123 -123 -123 -123 ...
## $ x2 : num [1:301] 45.5 45.5 45.5 45.5 45.5 ...
## $ fhat: num [1:301, 1:301] 0.00 0.00 1.08e-14 4.68e-16 1.78e-14 ...
image( )
and contour( )
There are various automatic methods.
But… crossvalidation!
For each bandwidth:
… but wait, what’s this mean, here?
For each bandwidth:
Why log?:
If \(f\) and \(g\) are probability distributions, and \(x_1, \ldots, x_n\) are drawn from \(f\), then \[ L(g) = \sum_{i=1}^n \log g(x_i) \lessapprox L(f) , \] i.e., is maximized for \(g=f\). (see: cross entropy)
n <- 400
xy <- cbind(rnorm(n, mean=rep(c(-3,3), times=c(n/2,n/2))), rnorm(n))
xygrid <- expand.grid(x=seq(-5,5,length.out=40),
y=seq(-5,5,length.out=40))
xy_fhat <- rep(NA, nrow(xygrid))
sigma <- 0.25
for (k in 1:nrow(xygrid)) {
xy_fhat[k] <- sum( exp(-( (xy[,1] - xygrid[k,1])^2 + (xy[,2] - xygrid[k,2])^2 )/(2*sigma^2) ) / (2 * pi * sigma^2 ) )
}
plot(xy, asp=1, pch=20)
points(xygrid[,1], xygrid[,2], cex=xy_fhat/6)
library(fields)
the_xval <- function (bw, points, ntest, grid_n=301) {
use_these <- sample(rep(c(TRUE, FALSE), times=c(nrow(xy)-ntest, ntest)))
smooth <- bkde2D(xy[use_these,], bandwidth=bw, gridsize=c(grid_n, grid_n))
names(smooth) <- c("x", "y", "z")
pred <- interp.surface(smooth, loc=xy[!use_these,])
return(mean(log(pred), na.rm=TRUE))
}
density_xval <- function (bw, points, ntest, grid_n=301) {
xy <- points@coords
use_these <- sample(rep(c(TRUE, FALSE), times=c(nrow(xy)-ntest, ntest)))
smooth <- bkde2D(xy[use_these,], bandwidth=bw, gridsize=c(grid_n, grid_n))
names(smooth) <- c("x", "y", "z")
pred <- interp.surface(smooth, loc=xy[!use_these,])
return(mean(log(pred), na.rm=TRUE))
}
lo_dur <- loess(Duration ~ Start_Latitude * Start_Longitude, data=bt, span=0.1)
pred_lats <- seq(min(bt$Start_Latitude), max(bt$Start_Latitude), length.out=101)
pred_lons <- seq(min(bt$Start_Longitude), max(bt$Start_Longitude), length.out=101)
pred_locs <- expand.grid(Start_Latitude=pred_lats,
Start_Longitude=pred_lons)
pred_duration <- predict(lo_dur, newdata=pred_locs)
Why?
Suggest another method.