Peter Ralph
12 January 2021 – Advanced Biological Statistics
Sometimes the data isn’t all there.
What to do?
Options:
Restrict to non-missing observations.
Impute the missing data and fit a model.
Fit a model that also imputes the missing data.
Notes:
beware of informative missingness
“impute” is a fancy word for “guess”
The system itself does \[\begin{aligned} x_{t+1} - x_t &= \alpha y_t + \Normal(0, \sigma_{xy}) \\ y_{t+1} - y_t &= - \beta x_t + \Normal(0, \sigma_{xy}) \end{aligned}\] but we only get to observe \[\begin{aligned} X_t &= x_t + \Normal(0, \sigma_\epsilon) \\ Y_t &= y_t + \Normal(0, \sigma_\epsilon) . \end{aligned}\]
Now what if we actually don’t observe most of the \(Y\) values?
Here’s what this looks like.
N <- 500
true_osc <- list(alpha=.1,
beta=.05,
sigma_xy=.05,
sigma_eps=.5)
xy <- matrix(nrow=N, ncol=2)
xy[1,] <- c(3,0)
for (k in 1:(N-1)) {
xy[k+1,] <- (xy[k,]
+ c(true_osc$alpha * xy[k,2],
(-1) * true_osc$beta * xy[k,1])
+ rnorm(2, 0, true_osc$sigma_xy))
}
XY <- xy + rnorm(N*2, 0, true_osc$sigma_eps)
obs_y <- sample.int(N, size=10)
XY[setdiff(1:N, obs_y), 2] <- NA
osc_block_missing <- "
data {
int N;
vector[N] X;
int k; // number of observed Y
int obs_y[k]; // which Y values are observed
vector[k] Y;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma_xy;
real<lower=0> sigma_eps;
vector[N] x;
vector[N] y;
}
model {
x[1] ~ normal(0, 5);
y[1] ~ normal(0, 5);
x[2:N] ~ normal(x[1:(N-1)] + alpha * y[1:(N-1)], sigma_xy);
y[2:N] ~ normal(y[1:(N-1)] - beta * x[1:(N-1)], sigma_xy);
X ~ normal(x, sigma_eps);
Y ~ normal(y[obs_y], sigma_eps);
alpha ~ normal(0, 1);
beta ~ normal(0, 1);
sigma_xy ~ normal(0, 1);
sigma_eps ~ normal(0, 1);
}
"
osc_model_missing <- stan_model(model_code=osc_block_missing)
osc_fit <- sampling(osc_model_missing,
data=list(N=N,
X=XY[,1],
k=length(obs_y),
obs_y=obs_y,
Y=XY[obs_y,2]),
iter=1000, chains=3,
control=list(max_treedepth=12))
## Warning: There were 3 chains where the estimated Bayesian Fraction of Missing Information was low. See
## http://mc-stan.org/misc/warnings.html#bfmi-low
## Warning: Examine the pairs() plot to diagnose sampling problems
## Warning: The largest R-hat is 1.23, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#r-hat
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess
cbind(truth=c(true_osc$alpha, true_osc$beta, true_osc$sigma_xy, true_osc$sigma_eps),
rstan::summary(osc_fit, pars=c("alpha", "beta", "sigma_xy", "sigma_eps"))$summary)
## truth mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## alpha 0.10 0.10006174 0.0009677287 0.006814326 0.08784278 0.09542431 0.09971968 0.10417984 0.11400593 49.58367 1.026429
## beta 0.05 0.04942227 0.0004721134 0.003324704 0.04285872 0.04721927 0.04940189 0.05159404 0.05592812 49.59219 1.029573
## sigma_xy 0.05 0.03829884 0.0024136398 0.008021550 0.02546288 0.03155093 0.03779207 0.04384452 0.05586678 11.04515 1.235094
## sigma_eps 0.50 0.50924896 0.0004712811 0.017190302 0.47754554 0.49682759 0.50867381 0.52081650 0.54386906 1330.47545 1.000318
Here is a density plot of 100 estimated trajectories (of x
and y
) from the Stan fit.