Imputation of missing data

Peter Ralph

12 January 2021 – Advanced Biological Statistics

Missing data

Sometimes the data isn’t all there.

What to do?

Options:

Restrict to non-missing observations.
Impute the missing data and fit a model.
Fit a model that also imputes the missing data.

Notes:

beware of informative missingness
“impute” is a fancy word for “guess”

A partially observed oscillator

The system itself does \[\begin{aligned} x_{t+1} - x_t &= \alpha y_t + \Normal(0, \sigma_{xy}) \\ y_{t+1} - y_t &= - \beta x_t + \Normal(0, \sigma_{xy}) \end{aligned}\] but we only get to observe \[\begin{aligned} X_t &= x_t + \Normal(0, \sigma_\epsilon) \\ Y_t &= y_t + \Normal(0, \sigma_\epsilon) . \end{aligned}\]

Even more realism?

Now what if we actually don’t observe most of the \(Y\) values?

Here’s what this looks like.

N <- 500
true_osc <- list(alpha=.1,
                 beta=.05,
                 sigma_xy=.05,
                 sigma_eps=.5)
xy <- matrix(nrow=N, ncol=2)
xy[1,] <- c(3,0)
for (k in 1:(N-1)) {
    xy[k+1,] <- (xy[k,] 
                + c(true_osc$alpha * xy[k,2],
                    (-1) * true_osc$beta * xy[k,1])
                + rnorm(2, 0, true_osc$sigma_xy))
}
XY <- xy + rnorm(N*2, 0, true_osc$sigma_eps)
obs_y <- sample.int(N, size=10)
XY[setdiff(1:N, obs_y), 2] <- NA

plot of chunk r plot_osc

A new Stan block

osc_block_missing <- "
data {
    int N;
    vector[N] X;
    int k; // number of observed Y
    int obs_y[k]; // which Y values are observed
    vector[k] Y;
}
parameters {
    real alpha;
    real beta;
    real<lower=0> sigma_xy;
    real<lower=0> sigma_eps;
    vector[N] x;
    vector[N] y;
}
model {
    x[1] ~ normal(0, 5);
    y[1] ~ normal(0, 5);
    x[2:N] ~ normal(x[1:(N-1)] + alpha * y[1:(N-1)], sigma_xy);
    y[2:N] ~ normal(y[1:(N-1)] - beta * x[1:(N-1)], sigma_xy);
    X ~ normal(x, sigma_eps);
    Y ~ normal(y[obs_y], sigma_eps);
    alpha ~ normal(0, 1);
    beta ~ normal(0, 1);
    sigma_xy ~ normal(0, 1);
    sigma_eps ~ normal(0, 1);
}
"
osc_model_missing <- stan_model(model_code=osc_block_missing)

osc_fit <- sampling(osc_model_missing,
                    data=list(N=N,
                              X=XY[,1],
                              k=length(obs_y),
                              obs_y=obs_y,
                              Y=XY[obs_y,2]),
                    iter=1000, chains=3,
                    control=list(max_treedepth=12))

## Warning: There were 3 chains where the estimated Bayesian Fraction of Missing Information was low. See
## http://mc-stan.org/misc/warnings.html#bfmi-low

## Warning: Examine the pairs() plot to diagnose sampling problems

## Warning: The largest R-hat is 1.23, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#r-hat

## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess

## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess

How’d we do?

cbind(truth=c(true_osc$alpha, true_osc$beta, true_osc$sigma_xy, true_osc$sigma_eps),
      rstan::summary(osc_fit, pars=c("alpha", "beta", "sigma_xy", "sigma_eps"))$summary)

##           truth       mean      se_mean          sd       2.5%        25%        50%        75%      97.5%      n_eff     Rhat
## alpha      0.10 0.10006174 0.0009677287 0.006814326 0.08784278 0.09542431 0.09971968 0.10417984 0.11400593   49.58367 1.026429
## beta       0.05 0.04942227 0.0004721134 0.003324704 0.04285872 0.04721927 0.04940189 0.05159404 0.05592812   49.59219 1.029573
## sigma_xy   0.05 0.03829884 0.0024136398 0.008021550 0.02546288 0.03155093 0.03779207 0.04384452 0.05586678   11.04515 1.235094
## sigma_eps  0.50 0.50924896 0.0004712811 0.017190302 0.47754554 0.49682759 0.50867381 0.52081650 0.54386906 1330.47545 1.000318

Here is a density plot of 100 estimated trajectories (of x and y) from the Stan fit.

plot of chunk r show_osc_fit