Peter Ralph
25 February 2020 – Advanced Biological Statistics
There are many dimension reduction methods, e.g.:
PCA uses the covariance matrix, which measures similarity.
t-SNE begins with the matrix of distances, measuring dissimilarity.
Are distances interpretable?
metric: In PCA, each axis is a fixed linear combination of variables. So, distances always mean the same thing no matter where you are on the plot.
non-metric: In t-SNE, distances within different clusters are not comparable.
From ordination.okstate.edu, about ordination in ecology:
Graphical results often lead to intuitive interpretations of species-environment relationships.
A single multivariate analysis saves time, in contrast to a separate univariate analysis for each species.
Ideally and typically, dimensions of this ‘low dimensional space’ will represent important and interpretable environmental gradients.
If statistical tests are desired, problems of multiple comparisons are diminished when species composition is studied in its entirety.
Statistical power is enhanced when species are considered in aggregate, because of redundancy.
By focusing on ‘important dimensions’, we avoid interpreting (and misinterpreting) noise.
Ordination methods are strongly influenced by sampling: ordination may ignore large-scale patterns in favor of describing variation within a highly oversampled area.
Ordination methods also describe patterns common to many variables: measuring the same thing many times may drive results.
Many methods are designed to find clusters, because our brain likes to categorize things. This doesn’t mean those clusters are well-separated in reality.
In data/passages.txt we have a number of short passages from a few different books.
Can we identify the authors of each passage?
The true sources of the passages are in data/passage_sources.tsv.
passages <- readLines("data/passages.txt")
sources <- read.table("data/passage_sources.tsv", header=TRUE)
words <- sort(unique(strsplit(paste(passages, collapse=" "), " +")[[1]]))
tabwords <- function (x, w) { tabulate(match(strsplit(x, " ")[[1]], w), nbins=length(w)) }
wordmat <- sapply(passages, tabwords, words)
dimnames(wordmat) <- list(words, NULL)
stopifnot( min(rowSums(wordmat)) > 0 )
wordmat[1:20, 1:20]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## a 14 12 9 11 11 15 10 3 8 17 7 17 13 5 6 5 12 10 23 18
## abaft 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abandon 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abandoned 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abandonedly 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abased 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abasement 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abashed 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abate 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abated 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abatement 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abating 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abbeyland 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abbreviate 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abbreviation 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abed 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abednego 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abhor 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abhorred 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abhorrence 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## PC1 PC2 PC3
## her -556.24697 -376.66196 283.396687
## to -1270.20317 -336.47252 4.169367
## she -372.74644 -269.55284 163.877453
## i -514.91246 -245.13026 -394.853974
## was -513.83419 -152.39714 176.601851
## not -351.47853 -121.32016 -44.912289
## be -341.68255 -119.99988 -59.975481
## you -300.09087 -118.90884 -230.908876
## had -296.31303 -100.38637 75.452498
## my -171.84572 -80.64843 -127.486206
## have -222.31283 -79.25235 -70.780515
## could -136.49462 -72.98578 42.671723
## your -104.28700 -59.78119 -83.419656
## for -372.71061 -59.60076 -33.283563
## elinor -72.15892 -57.66297 10.588410
## very -125.58949 -57.24809 -14.892789
## he -361.28662 -56.14820 9.890203
## as -389.82491 -52.10055 -9.719115
## been -132.36123 -50.83112 -1.425306
## it -478.64137 -50.35323 -91.722138
## would -139.21705 -49.48644 -6.323483
## him -215.14216 -47.74894 -9.871702
## me -123.09605 -43.78291 -101.675112
## am -60.50262 -43.60555 -49.158100
## said -93.08164 -43.15431 -37.542601
## herself -53.32406 -38.80712 34.261952
## marianne -52.25503 -38.26098 19.176196
## elizabeth -54.97617 -36.57203 14.782603
## no -138.64966 -35.48866 -14.964627
## miss -54.50430 -34.79724 5.829458
## sister -41.60392 -34.61997 5.844703
## them -137.06636 -34.17118 23.712412
## much -80.45020 -34.10671 4.931646
## mrs -45.20696 -33.69341 16.753606
## than -92.32421 -31.29817 4.833951
## such -102.18795 -30.28968 -11.610290
## which -172.77462 -30.17763 25.637991
## will -93.89414 -30.10624 -52.565814
## what -108.30480 -30.07389 -20.049484
## know -48.93996 -29.61414 -24.294954
## think -50.18130 -29.21047 -23.983101
## mr -46.19352 -28.27125 8.627980
## should -64.09034 -28.15082 -19.489807
## every -79.87694 -27.05275 14.036726
## did -68.23860 -25.24442 -4.521364
## mother -34.73709 -24.32705 11.336758
## so -197.55549 -23.69154 -13.565798
## must -75.70688 -23.18022 -16.982349
## more -113.30870 -22.23955 2.774334
## soon -48.73482 -22.10919 8.360629
## PC1 PC2 PC3
## the -1916.652323 644.17430 49.1763152
## whale -52.891925 71.73432 -13.0230081
## a -755.404311 62.27914 -65.2045656
## in -722.824139 56.81296 -2.9339225
## this -188.221698 55.26514 -44.0778875
## sea -24.183753 33.77098 -3.9215106
## whales -22.694393 33.18955 -3.4244265
## ahab -25.206923 32.79111 -4.3134355
## like -50.714502 31.59465 -10.2154201
## project -22.742369 31.25730 -13.2635817
## of -1269.014052 30.64074 54.2754589
## is -242.853145 30.61007 -128.4552553
## or -144.225179 30.16771 -21.7889034
## upon -54.271357 27.94857 -13.2215291
## one -126.850098 27.80594 -23.2263139
## his -419.112175 26.42307 44.6188432
## into -59.746708 25.35032 -5.3742671
## ship -19.381626 24.95806 -1.7182029
## old -29.523296 24.38154 -8.6070208
## boat -14.219788 22.29072 0.2084533
## these -39.427626 21.79437 -5.5106702
## all -236.300641 21.65010 -10.3527562
## captain -17.720521 20.63213 -3.8314659
## gutenbergtm -14.116430 20.63043 -9.4765720
## up -54.232661 19.93845 -4.4427531
## work -18.565428 19.14096 -10.6252198
## ye -14.954290 18.97349 -6.4293591
## from -191.072800 18.74055 20.0424965
## white -13.387233 18.53059 -0.8166223
## its -55.796354 18.35525 0.4599616
## then -71.691915 17.94492 -14.1305972
## out -68.072694 17.84982 -10.4037919
## round -18.302379 17.75297 -0.5098285
## thou -13.983667 17.19486 -8.7470726
## sperm -10.761954 17.15970 -2.9207274
## boats -10.385095 16.91503 -0.1120691
## head -21.715054 16.03585 -6.0853671
## down -38.166398 15.66476 -3.2617045
## yet -34.040677 15.41227 -4.3243774
## now -85.720599 14.95566 -8.1757894
## those -27.166364 14.64213 -1.9675383
## over -42.775127 14.46286 -1.2238253
## water -9.885997 13.82870 -1.8877527
## stubb -10.768649 13.46884 -3.6896725
## three -27.952696 13.39254 1.9711720
## some -85.864399 13.27762 -5.4547689
## through -21.897151 12.49487 -2.3030856
## ships -7.887251 12.22273 -1.3926845
## men -17.173346 12.15763 -0.2554109
## deck -6.975571 12.07884 -0.1660215
## PC1 PC2 PC3
## i -514.91246 -245.1302600 -394.85397
## you -300.09087 -118.9088378 -230.90888
## is -242.85314 30.6100710 -128.45526
## my -171.84572 -80.6484270 -127.48621
## me -123.09605 -43.7829134 -101.67511
## it -478.64137 -50.3532256 -91.72214
## your -104.28700 -59.7811913 -83.41966
## that -531.14942 -16.8334810 -73.80576
## have -222.31283 -79.2523538 -70.78052
## a -755.40431 62.2791376 -65.20457
## be -341.68255 -119.9998843 -59.97548
## will -93.89414 -30.1062364 -52.56581
## are -94.25073 10.5288597 -50.90685
## am -60.50262 -43.6055513 -49.15810
## not -351.47853 -121.3201614 -44.91229
## but -299.62259 -14.9274654 -44.37056
## this -188.22170 55.2651388 -44.07789
## if -101.45680 -6.3350358 -41.68779
## can -50.48943 -11.1903031 -38.03132
## do -75.56163 -20.5992068 -37.93423
## said -93.08164 -43.1543107 -37.54260
## we -73.71796 -2.3519195 -37.43557
## may -55.30176 -6.5217965 -35.40211
## has -61.01115 -14.1493917 -33.52528
## for -372.71061 -59.6007600 -33.28356
## know -48.93996 -29.6141445 -24.29495
## think -50.18130 -29.2104663 -23.98310
## one -126.85010 27.8059425 -23.22631
## or -144.22518 30.1677123 -21.78890
## shall -34.48396 -16.1281710 -21.67439
## what -108.30480 -30.0738928 -20.04948
## should -64.09034 -28.1508156 -19.48981
## us -36.17303 -3.1244047 -19.28163
## myself -23.74875 -15.0035251 -18.15643
## say -47.40917 -16.6219980 -17.94563
## must -75.70688 -23.1802189 -16.98235
## there -108.07781 8.9534402 -16.41797
## no -138.64966 -35.4886628 -14.96463
## very -125.58949 -57.2480930 -14.89279
## sure -27.11296 -20.5349678 -14.27811
## then -71.69192 17.9449190 -14.13060
## cannot -22.18292 -8.1427644 -13.96285
## here -29.06951 10.3644003 -13.95399
## so -197.55549 -23.6915411 -13.56580
## project -22.74237 31.2572971 -13.26358
## upon -54.27136 27.9485689 -13.22153
## whale -52.89193 71.7343208 -13.02301
## replied -19.88805 -18.1660195 -11.84924
## our -27.26869 -0.3535046 -11.65976
## such -102.18795 -30.2896785 -11.61029
## PC1 PC2 PC3
## her -556.246972 -376.661961 283.396687
## was -513.834189 -152.397142 176.601851
## she -372.746435 -269.552842 163.877453
## had -296.313029 -100.386371 75.452498
## of -1269.014052 30.640737 54.275459
## and -1236.908556 -13.864700 51.947817
## their -152.508489 -11.880643 51.480039
## the -1916.652323 644.174305 49.176315
## his -419.112175 26.423066 44.618843
## could -136.494621 -72.985780 42.671723
## were -156.375686 -15.088511 36.137252
## herself -53.324062 -38.807121 34.261952
## they -162.783297 -21.665918 25.891373
## which -172.774622 -30.177627 25.637991
## them -137.066360 -34.171177 23.712412
## by -238.708656 -2.633784 20.905141
## from -191.072800 18.740545 20.042497
## marianne -52.255035 -38.260976 19.176196
## mrs -45.206955 -33.693408 16.753606
## elizabeth -54.976167 -36.572034 14.782603
## every -79.876944 -27.052755 14.036726
## though -75.096366 -5.387347 12.358646
## on -225.699345 -17.010502 12.186196
## with -351.596172 -6.711234 12.014149
## mother -34.737090 -24.327053 11.336758
## elinor -72.158924 -57.662975 10.588410
## himself -40.103069 -3.197128 9.897340
## he -361.286622 -56.148199 9.890203
## dashwood -23.080458 -16.948564 9.585248
## lady -36.988480 -21.984896 9.570461
## house -30.031621 -10.141911 8.998615
## who -83.496441 -14.021101 8.672551
## mr -46.193515 -28.271252 8.627980
## still -37.592911 10.127417 8.567423
## soon -48.734824 -22.109185 8.360629
## an -113.603979 -6.283913 8.155292
## might -56.944947 -18.040714 7.811206
## jennings -20.539493 -17.547511 7.571813
## daughter -11.259362 -8.972879 7.396481
## home -15.680722 -5.854447 7.172435
## at -274.445973 -16.376403 6.589121
## before -63.920522 -6.681053 6.582418
## john -16.418885 -9.354251 6.528224
## sisters -21.029068 -16.415105 6.396043
## attention -15.247172 -11.031729 6.386811
## jane -23.591496 -18.286776 6.224460
## without -53.519726 -13.183584 6.125022
## mariannes -6.613075 -6.442404 6.093399
## felt -22.366896 -14.025896 6.046527
## object -11.237776 -3.892136 6.012015
Let’s build a conceptual model for descriptive analysis of “mixture” expression data.
Data: expression data from tissue samples that consist of various mixtures of different cell types.
Goal: identify shared coexpression patterns corresponding to cell type.
Similar situations: identify different developmental stages from whole-organism expression; common community structures from metagenomic data.
Each cell type has a typical set of mean expression levels.
Each sample is composed of a mixture of cell types, defined by the proportions that come from each type.
Mean expression by cell type.
Cell type proportions by sample.
\(x_{kj}\) : Mean expression of gene \(j\) in cell type \(k\).
\(w_{ik}\) : Proportion of sample \(i\) of cell type \(k\).
\(Z_{ij}\) : expression level in sample \(i\) of gene \(j\).
\[\begin{aligned} Z_{ij} \approx \sum_{k=1}^K w_{ik} x_{kj} . \end{aligned}\]
We are decomposing \(Z\) into the product of two lower-dimensional, nonnegative factors:
\[\begin{aligned} Z_{ij} &\approx \sum_k w_{ik} x_{kj} \\ w_{ik} &\ge 0 \\ x_{kj} &\ge 0 . \end{aligned}\]
PCA finds \(w\) and \(z\) to minimize \[\begin{aligned} \sum_{ij} \| Z_{ij} - \sum_k w_{ik} x_{kj} \|^2 . \end{aligned}\]
In other words, it is the maximum-likelihood solution to \[\begin{aligned} Z_{ij} &\sim \Normal(\sum_k w_{ik} x_{kj}, \sigma^2) . \end{aligned}\] (The eigenvectors are the columns of \(x\), and the eigenvectors are related to the size of \(w\) and \(x\).)
stan_pca <- stan_model(model_code="
data {
int N; // samples
int L; // variables
int K; // factors
real Z[L,N];
}
parameters {
matrix[L,K] x;
matrix[K,N] w;
real<lower=0> sigma;
}
model {
for (j in 1:L) {
Z[j] ~ normal(x[j] * w, sigma);
}
}
")
(note: needs some priors to work well; see here.)
A random set of \(k\) proportions \(0 \le P_i \le 1\) has a \(\Dirichlet(\alpha_1, \ldots, \alpha_k)\) if it has probability density \[\begin{aligned} \frac{1}{B(\alpha)} \prod_{i=1}^k p_i^{\alpha_i} \end{aligned}\] over the set of possible values \[\begin{aligned} P_1 + \cdots + P_k = 1 . \end{aligned}\]
This is useful as a prior on proportions.
The mean is \[ \left( \frac{\alpha_1}{\sum_j \alpha_j}, \frac{\alpha_2}{\sum_j \alpha_j}, \cdots, \frac{\alpha_k}{\sum_j \alpha_j} \right) . \]
This generalizes the Beta: if \(X \sim \Beta(a, b)\) then \((X, 1-X) \sim \Dirichlet(a, b)\).
Marginal distributions are Beta distributed: \(P_i \sim \Beta(\alpha_i, \sum_{j=1}^k \alpha_j - \alpha_i)\).
If \(X_i \sim \Gam(\text{shape}=\alpha_i)\), and \[\begin{aligned} P_i = X_i / \sum_{j=1}^k X_j \end{aligned}\] then \(P \sim \Dirichlet(\alpha)\).
“The \(k\)-simplex” is the set of proportions, i.e., nonnegative numbers \(p\) satisfying \[\begin{aligned} p_1 + \cdots p_k = 1 . \end{aligned}\]
parameters {
simplex[K] p;
}
model {
p ~ dirichlet(alpha);
}
Each cell type has a typical set of mean expression levels.
Each sample is composed of a mixture of cell types, defined by the proportions that come from each type.
Mean expression levels differ between cell types for only some of the genes.
Some samples are noisier than others.
Mean expression by cell type.
Cell type proportions by sample.
\(x_{kj}\) : Mean expression of gene \(j\) in cell type \(k\).
\(w_{ik}\) : Proportion of sample \(i\) of cell type \(k\).
\(Z_{ij}\) : expression in sample \(i\) of gene \(j\).
\[\begin{aligned} Z_{ij} \approx \sum_{k=1}^K w_{ik} x_{kj} . \end{aligned}\]
Mean expression by cell type.
Cell type proportions by sample.
Mean expression levels differ between cell types for only some of the genes.
Some samples are noisier than others.
\(Z_{ij}\) : expression level in sample \(i\) of gene \(j\).
\[\begin{aligned} Z_{ij} \approx \sum_{k=1}^K w_{ik} x_{kj} . \end{aligned}\]
\(y_j\), \(\eta_j\) : mean and SD of expression of gene \(j\) across all cell types; shrink \(x_{kj}\) towards \(y_j\).
(omit this)
data {
int N; // # samples
int L; // # genes
int K; // # cell types
int Z[N,L];
}
parameters {
matrix<lower=0>[L,K] x;
vector[L] y;
simplex[K] w[N];
vector<lower=0>[L] eta;
vector<lower=0>[K] alpha;
real<lower=0> d_alpha;
}
model {
for (i in 1:N) {
Z[i] ~ poisson(eta .* (x * w[i]));
w[i] ~ dirichlet(d_alpha * alpha);
}
for (j in 1:K)
{ x[,j] ~ normal(y ./ eta, 1); }
y ~ normal(0, 20);
alpha ~ normal(0, 1);
d_alpha ~ exponential(0.2);
eta ~ cauchy(0, 10);
}
\(x_{kj}\) : Mean expression of gene \(j\) in cell type \(k\).
\(w_{ik}\) : Proportion of sample \(i\) of cell type \(k\).
\[\begin{aligned} Z_{ij} \approx \sum_k w_{ik} x_{kj} . \end{aligned}\]
sampling(nmf_model,
data=list(N=10,
L=5,
K=2,
Z=matrix(rpois(50, 100), ncol=5)),
chains=1, iter=100)
##
## SAMPLING FOR MODEL '027c9a0b92082dc735bfb6350861f7ee' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 3.8e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.38 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: WARNING: There aren't enough warmup iterations to fit the
## Chain 1: three stages of adaptation as currently configured.
## Chain 1: Reducing each adaptation stage to 15%/75%/10% of
## Chain 1: the given number of warmup iterations:
## Chain 1: init_buffer = 7
## Chain 1: adapt_window = 38
## Chain 1: term_buffer = 5
## Chain 1:
## Chain 1: Iteration: 1 / 100 [ 1%] (Warmup)
## Chain 1: Iteration: 10 / 100 [ 10%] (Warmup)
## Chain 1: Iteration: 20 / 100 [ 20%] (Warmup)
## Chain 1: Iteration: 30 / 100 [ 30%] (Warmup)
## Chain 1: Iteration: 40 / 100 [ 40%] (Warmup)
## Chain 1: Iteration: 50 / 100 [ 50%] (Warmup)
## Chain 1: Iteration: 51 / 100 [ 51%] (Sampling)
## Chain 1: Iteration: 60 / 100 [ 60%] (Sampling)
## Chain 1: Iteration: 70 / 100 [ 70%] (Sampling)
## Chain 1: Iteration: 80 / 100 [ 80%] (Sampling)
## Chain 1: Iteration: 90 / 100 [ 90%] (Sampling)
## Chain 1: Iteration: 100 / 100 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.645456 seconds (Warm-up)
## Chain 1: 0.724525 seconds (Sampling)
## Chain 1: 1.36998 seconds (Total)
## Chain 1:
## Warning: There were 31 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
## Warning: Examine the pairs() plot to diagnose sampling problems
## Warning: The largest R-hat is 1.51, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#r-hat
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess
## Inference for Stan model: 027c9a0b92082dc735bfb6350861f7ee.
## 1 chains, each with iter=100; warmup=50; thin=1;
## post-warmup draws per chain=50, total post-warmup draws=50.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## x[1,1] 1.70 0.32 1.16 0.11 0.91 1.55 2.11 4.54 14 1.02
## x[1,2] 1.36 0.24 0.96 0.17 0.54 1.04 2.10 3.23 16 1.09
## x[2,1] 1.40 0.17 0.83 0.05 0.76 1.32 1.98 3.15 25 0.98
## x[2,2] 1.04 0.17 0.76 0.03 0.39 0.85 1.52 2.54 20 1.02
## x[3,1] 1.46 0.15 0.72 0.36 0.97 1.34 1.66 3.25 24 0.98
## x[3,2] 1.33 0.13 0.80 0.31 0.79 1.21 1.63 3.30 36 0.99
## x[4,1] 1.65 0.32 1.11 0.24 1.01 1.49 2.00 4.37 12 1.06
## x[4,2] 1.50 0.25 1.00 0.29 0.72 1.26 2.04 3.68 16 1.22
## x[5,1] 1.21 0.21 0.77 0.23 0.60 1.01 1.80 2.67 14 1.19
## x[5,2] 0.98 0.15 0.81 0.09 0.33 0.69 1.58 2.48 31 0.98
## y[1] 22.18 9.39 24.04 -15.51 5.02 17.74 38.62 68.42 7 1.28
## y[2] 14.92 6.30 17.30 -35.90 10.08 17.12 23.87 42.64 8 1.14
## y[3] 21.47 5.53 16.05 -3.37 8.59 23.05 33.59 48.79 8 1.10
## y[4] 17.65 9.52 22.42 -22.42 3.22 17.35 27.23 61.54 6 1.48
## y[5] 7.17 7.67 16.53 -30.15 -4.25 8.73 21.07 29.44 5 1.32
## w[1,1] 0.66 0.17 0.38 0.01 0.19 0.91 0.96 1.00 5 1.05
## w[1,2] 0.34 0.17 0.38 0.00 0.04 0.09 0.81 0.99 5 1.05
## w[2,1] 0.61 0.14 0.33 0.03 0.31 0.75 0.89 1.00 5 1.05
## w[2,2] 0.39 0.14 0.33 0.00 0.11 0.25 0.69 0.97 5 1.05
## w[3,1] 0.64 0.16 0.38 0.00 0.29 0.84 0.96 0.99 5 1.06
## w[3,2] 0.36 0.16 0.38 0.01 0.04 0.16 0.71 1.00 5 1.06
## w[4,1] 0.62 0.15 0.36 0.01 0.20 0.80 0.91 1.00 6 1.06
## w[4,2] 0.38 0.15 0.36 0.00 0.09 0.20 0.80 0.99 6 1.06
## w[5,1] 0.64 0.15 0.35 0.01 0.30 0.80 0.95 0.99 5 1.06
## w[5,2] 0.36 0.15 0.35 0.01 0.05 0.20 0.70 0.99 5 1.06
## w[6,1] 0.64 0.16 0.38 0.01 0.21 0.85 0.95 1.00 5 1.06
## w[6,2] 0.36 0.16 0.38 0.00 0.05 0.15 0.79 0.99 5 1.06
## w[7,1] 0.65 0.17 0.38 0.01 0.20 0.85 0.97 1.00 5 1.06
## w[7,2] 0.35 0.17 0.38 0.00 0.03 0.15 0.80 0.99 5 1.06
## w[8,1] 0.66 0.16 0.37 0.00 0.24 0.89 0.94 1.00 5 1.08
## w[8,2] 0.34 0.16 0.37 0.00 0.06 0.11 0.76 1.00 5 1.08
## w[9,1] 0.65 0.18 0.40 0.01 0.18 0.90 0.98 1.00 5 1.06
## w[9,2] 0.35 0.18 0.40 0.00 0.02 0.10 0.82 0.99 5 1.06
## w[10,1] 0.64 0.16 0.36 0.04 0.29 0.83 0.94 0.99 5 1.07
## w[10,2] 0.36 0.16 0.36 0.01 0.06 0.17 0.71 0.96 5 1.07
## eta[1] 86.34 13.14 96.26 22.72 45.77 59.16 88.58 290.04 54 0.99
## eta[2] 85.78 12.59 63.46 33.04 48.14 66.29 90.28 240.62 25 1.05
## eta[3] 87.22 6.01 55.41 28.69 56.57 71.68 100.63 219.86 85 0.99
## eta[4] 66.51 8.88 40.74 20.36 44.61 60.32 78.44 191.69 21 1.12
## eta[5] 116.24 11.21 84.55 34.96 56.82 94.13 144.99 315.12 57 1.05
## alpha[1] 1.13 0.24 0.74 0.06 0.57 1.22 1.66 2.63 10 1.06
## alpha[2] 0.51 0.19 0.55 0.06 0.13 0.23 0.81 1.77 9 0.98
## d_alpha 10.31 1.17 7.34 2.58 4.83 8.50 13.39 29.95 39 0.99
## lp__ 17330.43 2.49 8.25 17316.13 17325.34 17329.97 17335.56 17347.45 11 1.09
##
## Samples were drawn using NUTS(diag_e) at Tue Mar 3 21:19:26 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
How many cell types?
How many genes?
How many samples?
How much noise in expression?
How many genes distinguish cell types, and by how much relative to expression?
How many “noisy” genes? How many “noisy” samples?
How much variation in mixture proportions?
num_genes <- 500
num_cell_types <- 5
num_samples <- 100
num_diff_genes <- 400
mean_total_expression <- 50e6
sd_total_expression <- 5e6
gene_names <- paste0("gene_", apply(matrix(sample(letters, 5*num_genes, replace=TRUE), nrow=num_genes), 1, paste, collapse=''))
stopifnot(length(unique(gene_names)) == num_genes)
# mean expression profiles by cell type:
# x[k,j] is Mean expression of gene j in cell type k.
# we wil have num_diff_genes / num_cell_type genes specific to each cell type
# and each of these differing genes will have some randomly chosen expression level
x <- matrix(0, nrow=num_cell_types, ncol=num_genes)
colnames(x) <- gene_names
# vector of typical expression levels across *all* cell types
x_mean <- rgamma(num_genes, shape=0.5, scale=20000)
# which genes are differentially expressed: 0 mean no gene
diff_gene <- sample(0:num_cell_types, num_genes, replace=TRUE)
for (k in 1:num_cell_types) {
diffex <- which(diff_gene == k)
x[k,] <- x_mean
x[k,diffex] <- x[k,diffex] * runif(length(diffex), min=0, max=3)
# x[k,diffex] <- 10 * x[k,diffex]
}
# cell type proportions
# w[i,k] is proportion of sample i of cell type k.
# use gammas with shapes=vector of low integers;
# normalize to get Dirichlet
w <- matrix(0, nrow=num_samples, ncol=num_cell_types)
for (i in 1:num_samples) {
dirich <- rgamma(num_cell_types, rep(1, num_cell_types))
w[i,] <- dirich/sum(dirich)
}
stopifnot(all(abs(rowSums(w) - 1) < 1e-14))
# put these together to make total expression level
# Z[i,j] will be read counts of sample i for gene j
# 1. scale mean expression based on total expression level
# 2. matrix multiply mean scaled levels by proportions (x * w)
# 3. actual expression levels should be (poisson? normal?)
# this will have mean_Z[i,j] = (sum_k w[i,k] x[k,j])
mean_Z <- (w %*% x)
# but we want total sequencing to be y[i] so...
mean_Z <- y * mean_Z / rowSums(mean_Z)
stopifnot(all(abs(y - rowSums(mean_Z)) < 1e-8))
class_model_code2 <- "
parameters {
simplex[num_cell_types] w[num_samples];
// w[i][k] is proportion of sample i that is cell type k
matrix<lower=0>[num_cell_types, num_genes] x;
// x[k,j] is typical expression level of gene j in cell type k
vector<lower=0>[num_samples] y;
// y[i] is total expression level of sample i
vector<lower=0>[num_genes] x_mean;
// x_mean[j] is 'typical' expression level for cell type j across all cell types
real<lower=0> eta; // SD of x about x_mean
vector<lower=0>[num_cell_types] alpha; // mean of dirichlet
}"
class_model_code3 <- "
model {
matrix[num_samples, num_genes] mean_Z; // mean_Z[i,j] is 'expected' expression of gene j in sample i
for (i in 1:num_samples) {
for (j in 1:num_genes) {
// matrix mult version?
// mean_Z[i,j] = y[i] * (w[i] * x[,j]); // the first * is normal; the second is matrix mult
// for loop version:
mean_Z[i,j] = 0.0;
for (k in 1:num_cell_types) {
mean_Z[i,j] += y[i] * w[i][k] * x[k,j];
}
}
Z[i,] ~ poisson(mean_Z[i,]);
}
y ~ normal(0, 10);
for (k in 1:num_cell_types) {
x[k,] ~ normal(x_mean, eta * x_mean); // maybe this should be lognormal
}
for (i in 1:num_samples) {
w[i] ~ dirichlet(alpha);
}
alpha ~ exponential(1);
eta ~ normal(0, 4);
x_mean ~ gamma(0.5, 0.5e-4);
}
"
We do a very good job of estimating mixture proporitons, \(w\):
opt_w <- class_fit$par[grepl("^w", names(class_fit$par))]
dim(opt_w) <- c(num_samples, num_cell_types)
cor(w, opt_w)
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.55251630 -0.1093521 -0.73624936 -0.35751026 0.98772093
## [2,] 0.92630791 -0.5112326 0.09646298 -0.10969093 -0.28923879
## [3,] -0.08445503 -0.5009004 -0.33497043 0.95348430 -0.07801488
## [4,] -0.42205844 0.8740198 -0.07369501 -0.44996998 0.04533575
## [5,] 0.02055711 0.2733963 0.88992645 -0.07613433 -0.51310686
We are estimating overall mean expression less well:
And, similarly, we estimate cell-type-specific expression less well:
opt_x <- class_fit$par[grepl("^x\\[", names(class_fit$par))]
dim(opt_x) <- c(num_cell_types, num_genes)
cor(t(x), t(opt_x))
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.2826979 0.4820000 0.3362926 0.2461818 0.5468449
## [2,] 0.6127233 0.4143438 0.3468935 0.2641375 0.2284147
## [3,] 0.3908400 0.4193718 0.3373547 0.4889322 0.2725535
## [4,] 0.3134202 0.6438825 0.3642748 0.2071764 0.2544726
## [5,] 0.3798235 0.4908773 0.4712599 0.2872747 0.2357471