Peter Ralph
23 February 2020 – Advanced Biological Statistics
Let’s build a conceptual model for descriptive analysis of “mixture” expression data.
Data: expression data from tissue samples that consist of various mixtures of different cell types.
Goal: identify shared coexpression patterns corresponding to cell type.
Similar situations: identify different developmental stages from whole-organism expression; common community structures from metagenomic data.
Each cell type has a typical set of mean expression levels.
Each sample is composed of a mixture of cell types, defined by the proportions that come from each type.
Mean expression by cell type.
Cell type proportions by sample.
\(x_{kj}\) : Mean expression of gene \(j\) in cell type \(k\).
\(w_{ik}\) : Proportion of sample \(i\) of cell type \(k\).
\(Z_{ij}\) : expression level in sample \(i\) of gene \(j\).
\[\begin{aligned} Z_{ij} \approx \sum_{k=1}^K w_{ik} x_{kj} . \end{aligned}\]
We are decomposing \(Z\) into the product of two lower-dimensional, nonnegative factors:
\[\begin{aligned} Z_{ij} &\approx \sum_k w_{ik} x_{kj} \\ w_{ik} &\ge 0 \\ x_{kj} &\ge 0 . \end{aligned}\]
PCA finds \(w\) and \(z\) to minimize \[\begin{aligned} \sum_{ij} \| Z_{ij} - \sum_k w_{ik} x_{kj} \|^2 . \end{aligned}\]
In other words, it is the maximum-likelihood solution to \[\begin{aligned} Z_{ij} &\sim \Normal(\sum_k w_{ik} x_{kj}, \sigma^2) . \end{aligned}\] (The eigenvectors are the columns of \(x\), and the eigenvectors are related to the size of \(w\) and \(x\).)
stan_pca <- stan_model(model_code="
data {
int N; // samples
int L; // variables
int K; // factors
real Z[L,N];
}
parameters {
matrix[L,K] x;
matrix[K,N] w;
real<lower=0> sigma;
}
model {
for (j in 1:L) {
Z[j] ~ normal(x[j] * w, sigma);
}
}
")
(note: needs some priors to work well; see here.)
A random set of \(k\) proportions \(0 \le P_i \le 1\) has a \(\Dirichlet(\alpha_1, \ldots, \alpha_k)\) if it has probability density \[\begin{aligned} \frac{1}{B(\alpha)} \prod_{i=1}^k p_i^{\alpha_i} \end{aligned}\] over the set of possible values \[\begin{aligned} P_1 + \cdots + P_k = 1 . \end{aligned}\]
This is useful as a prior on proportions.
The mean is \[ \left( \frac{\alpha_1}{\sum_j \alpha_j}, \frac{\alpha_2}{\sum_j \alpha_j}, \cdots, \frac{\alpha_k}{\sum_j \alpha_j} \right) . \]
This generalizes the Beta: if \(X \sim \Beta(a, b)\) then \((X, 1-X) \sim \Dirichlet(a, b)\).
Marginal distributions are Beta distributed: \(P_i \sim \Beta(\alpha_i, \sum_{j=1}^k \alpha_j - \alpha_i)\).
If \(X_i \sim \Gam(\text{shape}=\alpha_i)\), and \[\begin{aligned} P_i = X_i / \sum_{j=1}^k X_j \end{aligned}\] then \(P \sim \Dirichlet(\alpha)\).
“The \(k\)-simplex” is the set of proportions, i.e., nonnegative numbers \(p\) satisfying \[\begin{aligned} p_1 + \cdots p_k = 1 . \end{aligned}\]
parameters {
simplex[K] p;
}
model {
p ~ dirichlet(alpha);
}
Each cell type has a typical set of mean expression levels.
Each sample is composed of a mixture of cell types, defined by the proportions that come from each type.
Mean expression levels differ between cell types for only some of the genes.
Some samples are noisier than others.
Mean expression by cell type.
Cell type proportions by sample.
\(x_{kj}\) : Mean expression of gene \(j\) in cell type \(k\).
\(w_{ik}\) : Proportion of sample \(i\) of cell type \(k\).
\(Z_{ij}\) : expression in sample \(i\) of gene \(j\).
\[\begin{aligned} Z_{ij} \approx \sum_{k=1}^K w_{ik} x_{kj} . \end{aligned}\]
Mean expression by cell type.
Cell type proportions by sample.
Mean expression levels differ between cell types for only some of the genes.
Some samples are noisier than others.
\(Z_{ij}\) : expression level in sample \(i\) of gene \(j\).
\[\begin{aligned} Z_{ij} \approx \sum_{k=1}^K w_{ik} x_{kj} . \end{aligned}\]
\(y_j\), \(\eta_j\) : mean and SD of expression of gene \(j\) across all cell types; shrink \(x_{kj}\) towards \(y_j\).
(omit this)
data {
int N; // # samples
int L; // # genes
int K; // # cell types
int Z[N,L];
}
parameters {
matrix<lower=0>[K,L] x;
vector<lower=0>[N] y;
vector<lower=0>[L] x_bar;
simplex[K] w[N];
real<lower=0> eta;
vector<lower=0>[K] alpha;
}
model {
matrix[N,L] mean_Z;
for (i in 1:N) {
for (j in 1:L) {
mean_Z[i,j] = 0.0;
for (k in 1:K) {
mean_Z[i,j] += y[i] * w[i][k] * x[k,j];
}
}
Z[i,] ~ poisson(mean_Z[i,]);
w[i] ~ dirichlet(alpha);
}
for (k in 1:K)
{ x[k,] ~ normal(x_bar, eta * x_bar); }
y ~ normal(0, 10);
alpha ~ exponential(1);
eta ~ normal(0, 4);
x_bar ~ gamma(0.5, 0.5e-4);
}
\(x_{kj}\) : Mean expression of gene \(j\) in cell type \(k\).
\(w_{ik}\) : Proportion of sample \(i\) of cell type \(k\).
\(y_{i}\) : Total sequencing depth of sample \(i\).
\[\begin{aligned} Z_{ij} \approx \sum_k w_{ik} x_{kj} . \end{aligned}\]
sampling(nmf_model,
data=list(N=10,
L=5,
K=2,
Z=matrix(rpois(50, 100), ncol=5)),
chains=1, iter=100)
##
## SAMPLING FOR MODEL '64f2c322cbd474154082cded7e231b18' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 4.9e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.49 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: WARNING: There aren't enough warmup iterations to fit the
## Chain 1: three stages of adaptation as currently configured.
## Chain 1: Reducing each adaptation stage to 15%/75%/10% of
## Chain 1: the given number of warmup iterations:
## Chain 1: init_buffer = 7
## Chain 1: adapt_window = 38
## Chain 1: term_buffer = 5
## Chain 1:
## Chain 1: Iteration: 1 / 100 [ 1%] (Warmup)
## Chain 1: Iteration: 10 / 100 [ 10%] (Warmup)
## Chain 1: Iteration: 20 / 100 [ 20%] (Warmup)
## Chain 1: Iteration: 30 / 100 [ 30%] (Warmup)
## Chain 1: Iteration: 40 / 100 [ 40%] (Warmup)
## Chain 1: Iteration: 50 / 100 [ 50%] (Warmup)
## Chain 1: Iteration: 51 / 100 [ 51%] (Sampling)
## Chain 1: Iteration: 60 / 100 [ 60%] (Sampling)
## Chain 1: Iteration: 70 / 100 [ 70%] (Sampling)
## Chain 1: Iteration: 80 / 100 [ 80%] (Sampling)
## Chain 1: Iteration: 90 / 100 [ 90%] (Sampling)
## Chain 1: Iteration: 100 / 100 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.414321 seconds (Warm-up)
## Chain 1: 0.677711 seconds (Sampling)
## Chain 1: 1.09203 seconds (Total)
## Chain 1:
## Warning: There were 9 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
## Warning: Examine the pairs() plot to diagnose sampling problems
## Warning: The largest R-hat is 1.13, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#r-hat
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess
## Inference for Stan model: 64f2c322cbd474154082cded7e231b18.
## 1 chains, each with iter=100; warmup=50; thin=1;
## post-warmup draws per chain=50, total post-warmup draws=50.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## x[1,1] 12.78 0.49 3.07 8.74 10.58 12.21 14.19 18.69 39 0.99
## x[1,2] 13.02 0.59 3.22 9.59 10.72 12.72 14.03 20.46 30 1.00
## x[1,3] 12.71 0.54 3.08 9.35 10.30 12.15 13.93 19.03 32 0.99
## x[1,4] 12.28 0.58 3.23 8.17 10.30 11.55 13.63 19.67 31 1.00
## x[1,5] 13.18 0.57 3.16 9.22 10.98 12.63 14.60 20.18 31 1.00
## x[2,1] 14.02 1.44 5.26 9.54 11.16 12.48 14.96 26.44 13 1.07
## x[2,2] 12.51 0.60 2.76 8.36 10.78 12.02 14.42 18.40 21 1.02
## x[2,3] 12.97 0.93 4.19 8.67 10.17 12.00 13.92 25.12 20 1.05
## x[2,4] 12.60 0.81 3.86 8.04 10.21 11.84 14.12 21.85 23 1.02
## x[2,5] 12.90 0.70 3.35 8.14 10.33 12.62 14.82 21.22 23 1.02
## y[1] 8.60 0.30 1.92 5.15 7.24 8.45 10.04 11.72 41 0.99
## y[2] 8.91 0.30 1.76 5.68 7.86 8.95 10.24 12.14 35 0.99
## y[3] 8.62 0.31 1.80 5.15 7.62 8.73 9.94 12.09 35 1.00
## y[4] 7.70 0.29 1.58 4.68 6.86 7.51 8.97 10.54 30 1.00
## y[5] 8.14 0.32 1.72 4.89 7.21 8.19 9.24 11.32 30 1.00
## y[6] 8.19 0.27 1.62 4.95 7.19 8.22 9.27 11.28 36 1.01
## y[7] 8.50 0.28 1.70 5.07 7.51 8.61 9.65 11.50 38 0.98
## y[8] 8.15 0.29 1.68 4.96 7.09 8.00 9.40 11.10 32 1.01
## y[9] 7.81 0.32 1.68 4.62 6.62 7.88 9.05 10.89 28 1.00
## y[10] 8.51 0.34 1.84 5.12 7.46 8.53 9.99 11.63 29 0.99
## x_bar[1] 13.60 1.03 4.35 9.25 11.21 12.69 14.84 21.62 18 1.04
## x_bar[2] 12.83 0.46 2.88 8.41 10.79 12.39 14.27 18.60 39 0.99
## x_bar[3] 12.97 0.71 3.45 9.19 10.53 12.31 14.08 22.14 23 1.01
## x_bar[4] 12.41 0.59 3.58 7.65 10.19 11.99 13.78 22.98 36 1.01
## x_bar[5] 12.98 0.50 3.09 8.91 10.57 12.95 14.52 19.88 38 1.00
## w[1,1] 0.63 0.10 0.33 0.08 0.35 0.65 0.98 1.00 12 0.98
## w[1,2] 0.37 0.10 0.33 0.00 0.02 0.35 0.65 0.92 12 0.98
## w[2,1] 0.61 0.12 0.36 0.00 0.29 0.70 0.95 1.00 10 0.99
## w[2,2] 0.39 0.12 0.36 0.00 0.05 0.30 0.71 1.00 10 0.99
## w[3,1] 0.62 0.09 0.31 0.02 0.42 0.69 0.89 1.00 11 1.07
## w[3,2] 0.38 0.09 0.31 0.00 0.11 0.31 0.58 0.98 11 1.07
## w[4,1] 0.57 0.10 0.35 0.04 0.25 0.55 0.95 1.00 12 1.00
## w[4,2] 0.43 0.10 0.35 0.00 0.05 0.45 0.75 0.96 12 1.00
## w[5,1] 0.63 0.10 0.33 0.07 0.34 0.70 0.96 1.00 11 1.01
## w[5,2] 0.37 0.10 0.33 0.00 0.04 0.30 0.66 0.93 11 1.01
## w[6,1] 0.57 0.12 0.39 0.01 0.12 0.72 0.91 1.00 11 0.98
## w[6,2] 0.43 0.12 0.39 0.00 0.09 0.28 0.88 0.99 11 0.98
## w[7,1] 0.69 0.09 0.32 0.04 0.39 0.85 0.98 1.00 12 1.01
## w[7,2] 0.31 0.09 0.32 0.00 0.02 0.15 0.61 0.96 12 1.01
## w[8,1] 0.59 0.13 0.39 0.04 0.13 0.75 0.98 1.00 9 1.07
## w[8,2] 0.41 0.13 0.39 0.00 0.02 0.25 0.87 0.96 9 1.07
## w[9,1] 0.55 0.10 0.33 0.06 0.24 0.58 0.87 1.00 12 1.03
## w[9,2] 0.45 0.10 0.33 0.00 0.13 0.42 0.76 0.94 12 1.03
## w[10,1] 0.64 0.13 0.37 0.01 0.31 0.82 0.97 1.00 8 0.99
## w[10,2] 0.36 0.13 0.37 0.00 0.03 0.18 0.69 0.99 8 0.99
## eta 0.09 0.02 0.07 0.02 0.04 0.06 0.11 0.28 12 0.99
## alpha[1] 1.47 0.24 1.17 0.28 0.80 1.18 1.69 4.11 23 0.98
## alpha[2] 1.04 0.39 1.15 0.14 0.36 0.58 1.28 4.17 9 1.01
## lp__ 18315.88 2.80 8.28 18303.49 18310.11 18315.44 18320.62 18333.12 9 0.98
##
## Samples were drawn using NUTS(diag_e) at Wed Feb 24 22:02:30 2021.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
How many cell types?
How many genes?
How many samples?
How much noise in expression?
How many genes distinguish cell types, and by how much relative to expression?
How much variation in mixture proportions?
How much variation in total coverage?
How many cell types? 5
How many genes? 500
How many samples? 100
How much noise in expression? 1–25,000 reads per gene, SD of 5%
How many genes distinguish cell types, and by how much relative to expression? 400 genes that differ by 2x
How much variation in mixture proportions? even +/- 25%
How much variation in total coverage? \(4 \times 10^6\) – \(60\times10^6\)
How many cell types? 23
How many genes? 10000
How many samples? 100
How much noise in expression? 1–25,000 reads per gene, SD of 25%
How many genes distinguish cell types, and by how much relative to expression? 400 genes that differ by 10%–1000%
How much variation in mixture proportions? skewed +/- 5%
How much variation in total coverage? \(0.5 \times 10^6\) – \(5\times10^6\)
Set up some parameters:
num_genes <- 500
num_cell_types <- 5
num_samples <- 100
num_diff_genes <- 400
mean_total_expression <- 50e6
sd_total_expression <- 5e6
gene_names <- paste0("gene_", apply(matrix(sample(letters, 5*num_genes, replace=TRUE), nrow=num_genes), 1, paste, collapse=''))
stopifnot(length(unique(gene_names)) == num_genes)
Mean expression profiles by cell type:
num_diff_genes / num_cell_type
genes specific to each cell type# x[k,j] is mean expression of gene j in cell type k.
x <- matrix(0, nrow=num_cell_types, ncol=num_genes)
colnames(x) <- gene_names
# vector of typical expression levels across *all* cell types
x_mean <- rgamma(num_genes, shape=0.5, scale=20000)
# which genes are differentially expressed in which cell type
diff_gene <- sample(0:num_cell_types, num_genes, replace=TRUE)
for (k in 1:num_cell_types) {
diffex <- which(diff_gene == k)
x[k,] <- x_mean
x[k,diffex] <- x[k,diffex] * runif(length(diffex), min=0, max=3)
}
Use gammas with shapes a vector of low integers; normalize to get Dirichlet.
# Z[i,j] is read counts of sample i for gene j
# this will have mean_Z[i,j] = (sum_k w[i,k] x[k,j])
mean_Z <- (w %*% x)
# but we want total sequencing to be y[i] so...
mean_Z <- y * mean_Z / rowSums(mean_Z)
stopifnot(all(abs(y - rowSums(mean_Z)) < 1e-8))
Z <- rpois(length(mean_Z), lambda=mean_Z)
dim(Z) <- dim(mean_Z)
Do we have signal?
We do a very good job of estimating mixture proportions, \(w\):
opt_w <- fit_sim1$par[grepl("^w", names(fit_sim1$par))]
dim(opt_w) <- c(num_samples, num_cell_types)
cor(w, opt_w)
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.04398499 -0.23072233 -0.02532198 -0.2307791 0.4644048
## [2,] -0.33333225 -0.65294813 0.61497455 0.7567144 -0.0962400
## [3,] -0.30185250 0.82656532 -0.58861296 0.4169296 -0.6288291
## [4,] 0.99283533 -0.09510431 -0.52816584 -0.4149401 -0.3930452
## [5,] -0.35484884 0.18070186 0.54327762 -0.5177444 0.6602966
We are estimating overall mean expression less well:
And, similarly, we estimate cell-type-specific expression less well:
opt_x <- fit_sim1$par[grepl("^x\\[", names(fit_sim1$par))]
dim(opt_x) <- c(num_cell_types, num_genes)
cor(t(x), t(opt_x))
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.3983108 0.3010583 0.4495956 0.3085280 0.4835641
## [2,] 0.3434644 0.2691097 0.4813357 0.4447535 0.3832638
## [3,] 0.3243645 0.4231296 0.3911760 0.4909628 0.2937164
## [4,] 0.7844388 0.2425806 0.3137577 0.2074952 0.2878479
## [5,] 0.3420303 0.3312768 0.4304676 0.2183988 0.5614074