Deconvolution and dimension reduction

Peter Ralph

25 February 2020 – Advanced Biological Statistics

Overview of dimension reduction

The menagerie

There are many dimension reduction methods, e.g.:

principal components analysis (PCA)
non-negative matrix factorization (NMF)
independent components analysis (ICA)
canonical correpondence analysis (CCA)
principal coordinates analysis (PCoA)
multidimensional scaling (MDS)
redundancy analysis (RDA)
Sammon mapping
kernel PCA
t-SNE
UMAP
locally linear embedding (LLE)
Laplacian eigenmaps
autoencoders

Using distances or similarities?

PCA uses the covariance matrix, which measures similarity.

t-SNE begins with the matrix of distances, measuring dissimilarity.

Metric or non-Metric?

Are distances interpretable?

metric: In PCA, each axis is a fixed linear combination of variables. So, distances always mean the same thing no matter where you are on the plot.

non-metric: In t-SNE, distances within different clusters are not comparable.

Why ordination?

From ordination.okstate.edu, about ordination in ecology:

Graphical results often lead to intuitive interpretations of species-environment relationships.
A single multivariate analysis saves time, in contrast to a separate univariate analysis for each species.
Ideally and typically, dimensions of this ‘low dimensional space’ will represent important and interpretable environmental gradients.
If statistical tests are desired, problems of multiple comparisons are diminished when species composition is studied in its entirety.
Statistical power is enhanced when species are considered in aggregate, because of redundancy.
By focusing on ‘important dimensions’, we avoid interpreting (and misinterpreting) noise.

Beware overinterpretation

Ordination methods are strongly influenced by sampling: ordination may ignore large-scale patterns in favor of describing variation within a highly oversampled area.
Ordination methods also describe patterns common to many variables: measuring the same thing many times may drive results.
Many methods are designed to find clusters, because our brain likes to categorize things. This doesn’t mean those clusters are well-separated in reality.

Text analysis

Identifying authors

In data/passages.txt we have a number of short passages from a few different books.

Can we identify the authors of each passage?

The true sources of the passages are in data/passage_sources.tsv.

Turn the data into a matrix

passages <- readLines("data/passages.txt")
sources <- read.table("data/passage_sources.tsv", header=TRUE)
words <- sort(unique(strsplit(paste(passages, collapse=" "), " +")[[1]]))
tabwords <- function (x, w) { tabulate(match(strsplit(x, " ")[[1]], w), nbins=length(w)) }
wordmat <- sapply(passages, tabwords, words)
dimnames(wordmat) <- list(words, NULL)
stopifnot( min(rowSums(wordmat)) > 0 )
wordmat[1:20, 1:20]

##              [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## a              14   12    9   11   11   15   10    3    8    17     7    17    13     5     6     5    12    10    23    18
## abaft           0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abandon         0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abandoned       0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abandonedly     0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abased          0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abasement       0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abashed         0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abate           0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abated          0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abatement       0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abating         0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abbeyland       0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abbreviate      0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abbreviation    0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abed            0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abednego        0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abhor           0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abhorred        0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abhorrence      0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0

PCA?

wordpcs <- prcomp(wordmat, scale.=TRUE)
layout(t(1:2))
plot(wordpcs$rotation[,1:2], col=sources$source, pch=20, xlab="PC1", ylab="PC2")
plot(wordpcs$rotation[,2:3], col=sources$source, pch=20, xlab="PC2", ylab="PC3")
legend("topright", pch=20, col=1:3, legend=levels(sources$source))

plot of chunk wordpca

PC1 is shortness

plot(colSums(wordmat), wordpcs$rotation[,1], col=sources$source, xlab='length', ylab='PC1')

plot of chunk wordlen

PC2 is book

##                   PC1        PC2         PC3
## her        -556.24697 -376.66196  283.396687
## to        -1270.20317 -336.47252    4.169367
## she        -372.74644 -269.55284  163.877453
## i          -514.91246 -245.13026 -394.853974
## was        -513.83419 -152.39714  176.601851
## not        -351.47853 -121.32016  -44.912289
## be         -341.68255 -119.99988  -59.975481
## you        -300.09087 -118.90884 -230.908876
## had        -296.31303 -100.38637   75.452498
## my         -171.84572  -80.64843 -127.486206
## have       -222.31283  -79.25235  -70.780515
## could      -136.49462  -72.98578   42.671723
## your       -104.28700  -59.78119  -83.419656
## for        -372.71061  -59.60076  -33.283563
## elinor      -72.15892  -57.66297   10.588410
## very       -125.58949  -57.24809  -14.892789
## he         -361.28662  -56.14820    9.890203
## as         -389.82491  -52.10055   -9.719115
## been       -132.36123  -50.83112   -1.425306
## it         -478.64137  -50.35323  -91.722138
## would      -139.21705  -49.48644   -6.323483
## him        -215.14216  -47.74894   -9.871702
## me         -123.09605  -43.78291 -101.675112
## am          -60.50262  -43.60555  -49.158100
## said        -93.08164  -43.15431  -37.542601
## herself     -53.32406  -38.80712   34.261952
## marianne    -52.25503  -38.26098   19.176196
## elizabeth   -54.97617  -36.57203   14.782603
## no         -138.64966  -35.48866  -14.964627
## miss        -54.50430  -34.79724    5.829458
## sister      -41.60392  -34.61997    5.844703
## them       -137.06636  -34.17118   23.712412
## much        -80.45020  -34.10671    4.931646
## mrs         -45.20696  -33.69341   16.753606
## than        -92.32421  -31.29817    4.833951
## such       -102.18795  -30.28968  -11.610290
## which      -172.77462  -30.17763   25.637991
## will        -93.89414  -30.10624  -52.565814
## what       -108.30480  -30.07389  -20.049484
## know        -48.93996  -29.61414  -24.294954
## think       -50.18130  -29.21047  -23.983101
## mr          -46.19352  -28.27125    8.627980
## should      -64.09034  -28.15082  -19.489807
## every       -79.87694  -27.05275   14.036726
## did         -68.23860  -25.24442   -4.521364
## mother      -34.73709  -24.32705   11.336758
## so         -197.55549  -23.69154  -13.565798
## must        -75.70688  -23.18022  -16.982349
## more       -113.30870  -22.23955    2.774334
## soon        -48.73482  -22.10919    8.360629

##                      PC1       PC2          PC3
## the         -1916.652323 644.17430   49.1763152
## whale         -52.891925  71.73432  -13.0230081
## a            -755.404311  62.27914  -65.2045656
## in           -722.824139  56.81296   -2.9339225
## this         -188.221698  55.26514  -44.0778875
## sea           -24.183753  33.77098   -3.9215106
## whales        -22.694393  33.18955   -3.4244265
## ahab          -25.206923  32.79111   -4.3134355
## like          -50.714502  31.59465  -10.2154201
## project       -22.742369  31.25730  -13.2635817
## of          -1269.014052  30.64074   54.2754589
## is           -242.853145  30.61007 -128.4552553
## or           -144.225179  30.16771  -21.7889034
## upon          -54.271357  27.94857  -13.2215291
## one          -126.850098  27.80594  -23.2263139
## his          -419.112175  26.42307   44.6188432
## into          -59.746708  25.35032   -5.3742671
## ship          -19.381626  24.95806   -1.7182029
## old           -29.523296  24.38154   -8.6070208
## boat          -14.219788  22.29072    0.2084533
## these         -39.427626  21.79437   -5.5106702
## all          -236.300641  21.65010  -10.3527562
## captain       -17.720521  20.63213   -3.8314659
## gutenbergtm   -14.116430  20.63043   -9.4765720
## up            -54.232661  19.93845   -4.4427531
## work          -18.565428  19.14096  -10.6252198
## ye            -14.954290  18.97349   -6.4293591
## from         -191.072800  18.74055   20.0424965
## white         -13.387233  18.53059   -0.8166223
## its           -55.796354  18.35525    0.4599616
## then          -71.691915  17.94492  -14.1305972
## out           -68.072694  17.84982  -10.4037919
## round         -18.302379  17.75297   -0.5098285
## thou          -13.983667  17.19486   -8.7470726
## sperm         -10.761954  17.15970   -2.9207274
## boats         -10.385095  16.91503   -0.1120691
## head          -21.715054  16.03585   -6.0853671
## down          -38.166398  15.66476   -3.2617045
## yet           -34.040677  15.41227   -4.3243774
## now           -85.720599  14.95566   -8.1757894
## those         -27.166364  14.64213   -1.9675383
## over          -42.775127  14.46286   -1.2238253
## water          -9.885997  13.82870   -1.8877527
## stubb         -10.768649  13.46884   -3.6896725
## three         -27.952696  13.39254    1.9711720
## some          -85.864399  13.27762   -5.4547689
## through       -21.897151  12.49487   -2.3030856
## ships          -7.887251  12.22273   -1.3926845
## men           -17.173346  12.15763   -0.2554109
## deck           -6.975571  12.07884   -0.1660215

PC3 ???

##                PC1          PC2        PC3
## i       -514.91246 -245.1302600 -394.85397
## you     -300.09087 -118.9088378 -230.90888
## is      -242.85314   30.6100710 -128.45526
## my      -171.84572  -80.6484270 -127.48621
## me      -123.09605  -43.7829134 -101.67511
## it      -478.64137  -50.3532256  -91.72214
## your    -104.28700  -59.7811913  -83.41966
## that    -531.14942  -16.8334810  -73.80576
## have    -222.31283  -79.2523538  -70.78052
## a       -755.40431   62.2791376  -65.20457
## be      -341.68255 -119.9998843  -59.97548
## will     -93.89414  -30.1062364  -52.56581
## are      -94.25073   10.5288597  -50.90685
## am       -60.50262  -43.6055513  -49.15810
## not     -351.47853 -121.3201614  -44.91229
## but     -299.62259  -14.9274654  -44.37056
## this    -188.22170   55.2651388  -44.07789
## if      -101.45680   -6.3350358  -41.68779
## can      -50.48943  -11.1903031  -38.03132
## do       -75.56163  -20.5992068  -37.93423
## said     -93.08164  -43.1543107  -37.54260
## we       -73.71796   -2.3519195  -37.43557
## may      -55.30176   -6.5217965  -35.40211
## has      -61.01115  -14.1493917  -33.52528
## for     -372.71061  -59.6007600  -33.28356
## know     -48.93996  -29.6141445  -24.29495
## think    -50.18130  -29.2104663  -23.98310
## one     -126.85010   27.8059425  -23.22631
## or      -144.22518   30.1677123  -21.78890
## shall    -34.48396  -16.1281710  -21.67439
## what    -108.30480  -30.0738928  -20.04948
## should   -64.09034  -28.1508156  -19.48981
## us       -36.17303   -3.1244047  -19.28163
## myself   -23.74875  -15.0035251  -18.15643
## say      -47.40917  -16.6219980  -17.94563
## must     -75.70688  -23.1802189  -16.98235
## there   -108.07781    8.9534402  -16.41797
## no      -138.64966  -35.4886628  -14.96463
## very    -125.58949  -57.2480930  -14.89279
## sure     -27.11296  -20.5349678  -14.27811
## then     -71.69192   17.9449190  -14.13060
## cannot   -22.18292   -8.1427644  -13.96285
## here     -29.06951   10.3644003  -13.95399
## so      -197.55549  -23.6915411  -13.56580
## project  -22.74237   31.2572971  -13.26358
## upon     -54.27136   27.9485689  -13.22153
## whale    -52.89193   71.7343208  -13.02301
## replied  -19.88805  -18.1660195  -11.84924
## our      -27.26869   -0.3535046  -11.65976
## such    -102.18795  -30.2896785  -11.61029

##                    PC1         PC2        PC3
## her        -556.246972 -376.661961 283.396687
## was        -513.834189 -152.397142 176.601851
## she        -372.746435 -269.552842 163.877453
## had        -296.313029 -100.386371  75.452498
## of        -1269.014052   30.640737  54.275459
## and       -1236.908556  -13.864700  51.947817
## their      -152.508489  -11.880643  51.480039
## the       -1916.652323  644.174305  49.176315
## his        -419.112175   26.423066  44.618843
## could      -136.494621  -72.985780  42.671723
## were       -156.375686  -15.088511  36.137252
## herself     -53.324062  -38.807121  34.261952
## they       -162.783297  -21.665918  25.891373
## which      -172.774622  -30.177627  25.637991
## them       -137.066360  -34.171177  23.712412
## by         -238.708656   -2.633784  20.905141
## from       -191.072800   18.740545  20.042497
## marianne    -52.255035  -38.260976  19.176196
## mrs         -45.206955  -33.693408  16.753606
## elizabeth   -54.976167  -36.572034  14.782603
## every       -79.876944  -27.052755  14.036726
## though      -75.096366   -5.387347  12.358646
## on         -225.699345  -17.010502  12.186196
## with       -351.596172   -6.711234  12.014149
## mother      -34.737090  -24.327053  11.336758
## elinor      -72.158924  -57.662975  10.588410
## himself     -40.103069   -3.197128   9.897340
## he         -361.286622  -56.148199   9.890203
## dashwood    -23.080458  -16.948564   9.585248
## lady        -36.988480  -21.984896   9.570461
## house       -30.031621  -10.141911   8.998615
## who         -83.496441  -14.021101   8.672551
## mr          -46.193515  -28.271252   8.627980
## still       -37.592911   10.127417   8.567423
## soon        -48.734824  -22.109185   8.360629
## an         -113.603979   -6.283913   8.155292
## might       -56.944947  -18.040714   7.811206
## jennings    -20.539493  -17.547511   7.571813
## daughter    -11.259362   -8.972879   7.396481
## home        -15.680722   -5.854447   7.172435
## at         -274.445973  -16.376403   6.589121
## before      -63.920522   -6.681053   6.582418
## john        -16.418885   -9.354251   6.528224
## sisters     -21.029068  -16.415105   6.396043
## attention   -15.247172  -11.031729   6.386811
## jane        -23.591496  -18.286776   6.224460
## without     -53.519726  -13.183584   6.125022
## mariannes    -6.613075   -6.442404   6.093399
## felt        -22.366896  -14.025896   6.046527
## object      -11.237776   -3.892136   6.012015

Visualizing expression space

A conceptual model

Let’s build a conceptual model for descriptive analysis of “mixture” expression data.

Data: expression data from tissue samples that consist of various mixtures of different cell types.

Goal: identify shared coexpression patterns corresponding to cell type.

Similar situations: identify different developmental stages from whole-organism expression; common community structures from metagenomic data.

Each cell type has a typical set of mean expression levels.
Each sample is composed of a mixture of cell types, defined by the proportions that come from each type.

Mean expression by cell type.
Cell type proportions by sample.

\(x_{kj}\) : Mean expression of gene \(j\) in cell type \(k\).
\(w_{ik}\) : Proportion of sample \(i\) of cell type \(k\).

\(Z_{ij}\) : expression level in sample \(i\) of gene \(j\).

\[\begin{aligned} Z_{ij} \approx \sum_{k=1}^K w_{ik} x_{kj} . \end{aligned}\]

Nonnegative matrix factorization

… aka “NMF”

We are decomposing \(Z\) into the product of two lower-dimensional, nonnegative factors:

\[\begin{aligned} Z_{ij} &\approx \sum_k w_{ik} x_{kj} \\ w_{ik} &\ge 0 \\ x_{kj} &\ge 0 . \end{aligned}\]

A simple NMF model

simple_nmf <- stan_model(model_code="
data {
    int N; // samples
    int L; // variables
    int K; // factors
    real Z[L,N];
}
parameters {
    matrix<lower=0>[L,K] x;
    matrix<lower=0>[K,N] w;
    real<lower=0> sigma;
}
model {
    for (j in 1:L) {
        Z[j] ~ normal(x[j] * w, sigma);
    }
}
")

Relationship to PCA

PCA finds \(w\) and \(z\) to minimize \[\begin{aligned} \sum_{ij} \| Z_{ij} - \sum_k w_{ik} x_{kj} \|^2 . \end{aligned}\]

In other words, it is the maximum-likelihood solution to \[\begin{aligned} Z_{ij} &\sim \Normal(\sum_k w_{ik} x_{kj}, \sigma^2) . \end{aligned}\] (The eigenvectors are the columns of \(x\), and the eigenvectors are related to the size of \(w\) and \(x\).)

PCA, in Stan

stan_pca <- stan_model(model_code="
data {
    int N; // samples
    int L; // variables
    int K; // factors
    real Z[L,N];
}
parameters {
    matrix[L,K] x;
    matrix[K,N] w;
    real<lower=0> sigma;
}
model {
    for (j in 1:L) {
        Z[j] ~ normal(x[j] * w, sigma);
    }
}
")

(note: needs some priors to work well; see here.)

Stochastic minute

the Dirichlet distribution

A random set of \(k\) proportions \(0 \le P_i \le 1\) has a \(\Dirichlet(\alpha_1, \ldots, \alpha_k)\) if it has probability density \[\begin{aligned} \frac{1}{B(\alpha)} \prod_{i=1}^k p_i^{\alpha_i} \end{aligned}\] over the set of possible values \[\begin{aligned} P_1 + \cdots + P_k = 1 . \end{aligned}\]

This is useful as a prior on proportions.
The mean is \[ \left( \frac{\alpha_1}{\sum_j \alpha_j}, \frac{\alpha_2}{\sum_j \alpha_j}, \cdots, \frac{\alpha_k}{\sum_j \alpha_j} \right) . \]
This generalizes the Beta: if \(X \sim \Beta(a, b)\) then \((X, 1-X) \sim \Dirichlet(a, b)\).

Marginal distributions are Beta distributed: \(P_i \sim \Beta(\alpha_i, \sum_{j=1}^k \alpha_j - \alpha_i)\).
If \(X_i \sim \Gam(\text{shape}=\alpha_i)\), and \[\begin{aligned} P_i = X_i / \sum_{j=1}^k X_j \end{aligned}\] then \(P \sim \Dirichlet(\alpha)\).

“Simplex” parameters

“The \(k\)-simplex” is the set of proportions, i.e., nonnegative numbers \(p\) satisfying \[\begin{aligned} p_1 + \cdots p_k = 1 . \end{aligned}\]

parameters {
    simplex[K] p;
}
model {
    p ~ dirichlet(alpha);
}

Back to expression space

Each cell type has a typical set of mean expression levels.
Each sample is composed of a mixture of cell types, defined by the proportions that come from each type.
Mean expression levels differ between cell types for only some of the genes.
Some samples are noisier than others.

Mean expression by cell type.
Cell type proportions by sample.

\(x_{kj}\) : Mean expression of gene \(j\) in cell type \(k\).
\(w_{ik}\) : Proportion of sample \(i\) of cell type \(k\).

\(Z_{ij}\) : expression in sample \(i\) of gene \(j\).

\[\begin{aligned} Z_{ij} \approx \sum_{k=1}^K w_{ik} x_{kj} . \end{aligned}\]

Mean expression by cell type.
Cell type proportions by sample.
Mean expression levels differ between cell types for only some of the genes.
Some samples are noisier than others.

\(Z_{ij}\) : expression level in sample \(i\) of gene \(j\).

\[\begin{aligned} Z_{ij} \approx \sum_{k=1}^K w_{ik} x_{kj} . \end{aligned}\]

\(y_j\), \(\eta_j\) : mean and SD of expression of gene \(j\) across all cell types; shrink \(x_{kj}\) towards \(y_j\).
(omit this)

data {
  int N; // # samples
  int L; // # genes
  int K; // # cell types
  int Z[N,L];
}

parameters {
  matrix<lower=0>[L,K] x;
  vector[L] y;
  simplex[K] w[N];
  vector<lower=0>[L] eta;
  vector<lower=0>[K] alpha;
  real<lower=0> d_alpha;
}

model {
  for (i in 1:N) {
      Z[i] ~ poisson(eta .* (x * w[i]));
      w[i] ~ dirichlet(d_alpha * alpha);
  }
  for (j in 1:K) 
      { x[,j] ~ normal(y ./ eta, 1); }
  y ~ normal(0, 20);
  alpha ~ normal(0, 1);
  d_alpha ~ exponential(0.2);
  eta ~ cauchy(0, 10);
}

\(x_{kj}\) : Mean expression of gene \(j\) in cell type \(k\).
\(w_{ik}\) : Proportion of sample \(i\) of cell type \(k\).

\[\begin{aligned} Z_{ij} \approx \sum_k w_{ik} x_{kj} . \end{aligned}\]

\(y_j\), \(\eta_j\) : mean and SD of expression of gene \(j\) across all cell types; shrink \(x_{kj}\) towards \(y_j\).

Testing: compiles?

nmf_model <- stan_model(model_code=nmf_block)

Testing: runs?

sampling(nmf_model,
         data=list(N=10,
                   L=5,
                   K=2,
                   Z=matrix(rpois(50, 100), ncol=5)),
         chains=1, iter=100)

## 
## SAMPLING FOR MODEL '027c9a0b92082dc735bfb6350861f7ee' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 3.8e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.38 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: WARNING: There aren't enough warmup iterations to fit the
## Chain 1:          three stages of adaptation as currently configured.
## Chain 1:          Reducing each adaptation stage to 15%/75%/10% of
## Chain 1:          the given number of warmup iterations:
## Chain 1:            init_buffer = 7
## Chain 1:            adapt_window = 38
## Chain 1:            term_buffer = 5
## Chain 1: 
## Chain 1: Iteration:  1 / 100 [  1%]  (Warmup)
## Chain 1: Iteration: 10 / 100 [ 10%]  (Warmup)
## Chain 1: Iteration: 20 / 100 [ 20%]  (Warmup)
## Chain 1: Iteration: 30 / 100 [ 30%]  (Warmup)
## Chain 1: Iteration: 40 / 100 [ 40%]  (Warmup)
## Chain 1: Iteration: 50 / 100 [ 50%]  (Warmup)
## Chain 1: Iteration: 51 / 100 [ 51%]  (Sampling)
## Chain 1: Iteration: 60 / 100 [ 60%]  (Sampling)
## Chain 1: Iteration: 70 / 100 [ 70%]  (Sampling)
## Chain 1: Iteration: 80 / 100 [ 80%]  (Sampling)
## Chain 1: Iteration: 90 / 100 [ 90%]  (Sampling)
## Chain 1: Iteration: 100 / 100 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.645456 seconds (Warm-up)
## Chain 1:                0.724525 seconds (Sampling)
## Chain 1:                1.36998 seconds (Total)
## Chain 1:

## Warning: There were 31 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded

## Warning: Examine the pairs() plot to diagnose sampling problems

## Warning: The largest R-hat is 1.51, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#r-hat

## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess

## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess

## Inference for Stan model: 027c9a0b92082dc735bfb6350861f7ee.
## 1 chains, each with iter=100; warmup=50; thin=1; 
## post-warmup draws per chain=50, total post-warmup draws=50.
## 
##              mean se_mean    sd     2.5%      25%      50%      75%    97.5% n_eff Rhat
## x[1,1]       1.70    0.32  1.16     0.11     0.91     1.55     2.11     4.54    14 1.02
## x[1,2]       1.36    0.24  0.96     0.17     0.54     1.04     2.10     3.23    16 1.09
## x[2,1]       1.40    0.17  0.83     0.05     0.76     1.32     1.98     3.15    25 0.98
## x[2,2]       1.04    0.17  0.76     0.03     0.39     0.85     1.52     2.54    20 1.02
## x[3,1]       1.46    0.15  0.72     0.36     0.97     1.34     1.66     3.25    24 0.98
## x[3,2]       1.33    0.13  0.80     0.31     0.79     1.21     1.63     3.30    36 0.99
## x[4,1]       1.65    0.32  1.11     0.24     1.01     1.49     2.00     4.37    12 1.06
## x[4,2]       1.50    0.25  1.00     0.29     0.72     1.26     2.04     3.68    16 1.22
## x[5,1]       1.21    0.21  0.77     0.23     0.60     1.01     1.80     2.67    14 1.19
## x[5,2]       0.98    0.15  0.81     0.09     0.33     0.69     1.58     2.48    31 0.98
## y[1]        22.18    9.39 24.04   -15.51     5.02    17.74    38.62    68.42     7 1.28
## y[2]        14.92    6.30 17.30   -35.90    10.08    17.12    23.87    42.64     8 1.14
## y[3]        21.47    5.53 16.05    -3.37     8.59    23.05    33.59    48.79     8 1.10
## y[4]        17.65    9.52 22.42   -22.42     3.22    17.35    27.23    61.54     6 1.48
## y[5]         7.17    7.67 16.53   -30.15    -4.25     8.73    21.07    29.44     5 1.32
## w[1,1]       0.66    0.17  0.38     0.01     0.19     0.91     0.96     1.00     5 1.05
## w[1,2]       0.34    0.17  0.38     0.00     0.04     0.09     0.81     0.99     5 1.05
## w[2,1]       0.61    0.14  0.33     0.03     0.31     0.75     0.89     1.00     5 1.05
## w[2,2]       0.39    0.14  0.33     0.00     0.11     0.25     0.69     0.97     5 1.05
## w[3,1]       0.64    0.16  0.38     0.00     0.29     0.84     0.96     0.99     5 1.06
## w[3,2]       0.36    0.16  0.38     0.01     0.04     0.16     0.71     1.00     5 1.06
## w[4,1]       0.62    0.15  0.36     0.01     0.20     0.80     0.91     1.00     6 1.06
## w[4,2]       0.38    0.15  0.36     0.00     0.09     0.20     0.80     0.99     6 1.06
## w[5,1]       0.64    0.15  0.35     0.01     0.30     0.80     0.95     0.99     5 1.06
## w[5,2]       0.36    0.15  0.35     0.01     0.05     0.20     0.70     0.99     5 1.06
## w[6,1]       0.64    0.16  0.38     0.01     0.21     0.85     0.95     1.00     5 1.06
## w[6,2]       0.36    0.16  0.38     0.00     0.05     0.15     0.79     0.99     5 1.06
## w[7,1]       0.65    0.17  0.38     0.01     0.20     0.85     0.97     1.00     5 1.06
## w[7,2]       0.35    0.17  0.38     0.00     0.03     0.15     0.80     0.99     5 1.06
## w[8,1]       0.66    0.16  0.37     0.00     0.24     0.89     0.94     1.00     5 1.08
## w[8,2]       0.34    0.16  0.37     0.00     0.06     0.11     0.76     1.00     5 1.08
## w[9,1]       0.65    0.18  0.40     0.01     0.18     0.90     0.98     1.00     5 1.06
## w[9,2]       0.35    0.18  0.40     0.00     0.02     0.10     0.82     0.99     5 1.06
## w[10,1]      0.64    0.16  0.36     0.04     0.29     0.83     0.94     0.99     5 1.07
## w[10,2]      0.36    0.16  0.36     0.01     0.06     0.17     0.71     0.96     5 1.07
## eta[1]      86.34   13.14 96.26    22.72    45.77    59.16    88.58   290.04    54 0.99
## eta[2]      85.78   12.59 63.46    33.04    48.14    66.29    90.28   240.62    25 1.05
## eta[3]      87.22    6.01 55.41    28.69    56.57    71.68   100.63   219.86    85 0.99
## eta[4]      66.51    8.88 40.74    20.36    44.61    60.32    78.44   191.69    21 1.12
## eta[5]     116.24   11.21 84.55    34.96    56.82    94.13   144.99   315.12    57 1.05
## alpha[1]     1.13    0.24  0.74     0.06     0.57     1.22     1.66     2.63    10 1.06
## alpha[2]     0.51    0.19  0.55     0.06     0.13     0.23     0.81     1.77     9 0.98
## d_alpha     10.31    1.17  7.34     2.58     4.83     8.50    13.39    29.95    39 0.99
## lp__     17330.43    2.49  8.25 17316.13 17325.34 17329.97 17335.56 17347.45    11 1.09
## 
## Samples were drawn using NUTS(diag_e) at Tue Mar  3 21:19:26 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Simulate data

Outline

How many cell types?
How many genes?
How many samples?
How much noise in expression?
How many genes distinguish cell types, and by how much relative to expression?
How many “noisy” genes? How many “noisy” samples?
How much variation in mixture proportions?

In class

num_genes <- 500
num_cell_types <- 5
num_samples <- 100

num_diff_genes <- 400
mean_total_expression <- 50e6
sd_total_expression <- 5e6

gene_names <- paste0("gene_", apply(matrix(sample(letters, 5*num_genes, replace=TRUE), nrow=num_genes), 1, paste, collapse=''))
stopifnot(length(unique(gene_names)) == num_genes)

# mean expression profiles by cell type:
# x[k,j] is Mean expression of gene j in cell type k.
# we wil have num_diff_genes / num_cell_type genes specific to each cell type
# and each of these differing genes will have some randomly chosen expression level
x <- matrix(0, nrow=num_cell_types, ncol=num_genes)
colnames(x) <- gene_names
# vector of typical expression levels across *all* cell types
x_mean <- rgamma(num_genes, shape=0.5, scale=20000)
# which genes are differentially expressed: 0 mean no gene
diff_gene <- sample(0:num_cell_types, num_genes, replace=TRUE)
for (k in 1:num_cell_types) {
    diffex <- which(diff_gene == k)
    x[k,] <- x_mean
    x[k,diffex] <- x[k,diffex] * runif(length(diffex), min=0, max=3)
    # x[k,diffex] <- 10 * x[k,diffex]
}

Gene expression profiles

x_ord <- order(diff_gene, x_mean)
matplot(t(x[,x_ord]), type='l', xlab='gene', ylab='mean expression')

plot of chunk simit3

Gene expression profiles

layout(1:5)
par(mar=c(0,4,0,0))
for (k in 1:5) plot(x[k,x_ord], type='l', col=k)

plot of chunk simit4

# cell type proportions
# w[i,k] is proportion of sample i of cell type k.
# use gammas with shapes=vector of low integers;
# normalize to get Dirichlet
w <- matrix(0, nrow=num_samples, ncol=num_cell_types)
for (i in 1:num_samples) {
    dirich <- rgamma(num_cell_types, rep(1, num_cell_types))
    w[i,] <- dirich/sum(dirich)
}
stopifnot(all(abs(rowSums(w) - 1) < 1e-14))

Cell types proportions

pairs(w, xlim=c(0,1), ylim=c(0,1))

plot of chunk simit6

# total expression per sample:
# y[i] is total reads for sample i
# this would be hard: y <- rcauchy(num_samples, location=mean_total_expression, scale=sd_total_expression)
y <- rnorm(num_samples, mean=mean_total_expression, sd=sd_total_expression)

hist(y, main='total expression by sample')

plot of chunk simit7

# put these together to make total expression level
# Z[i,j] will be read counts of sample i for gene j
# 1. scale mean expression based on total expression level
# 2. matrix multiply mean scaled levels by proportions (x * w)
# 3. actual expression levels should be (poisson? normal?)

# this will have mean_Z[i,j] = (sum_k w[i,k] x[k,j])
mean_Z <- (w %*% x)
# but we want total sequencing to be y[i] so...
mean_Z <- y * mean_Z / rowSums(mean_Z)
stopifnot(all(abs(y - rowSums(mean_Z)) < 1e-8))

# we should make this overdispersed, here are two options
# Z <- rnegbinom(length(mean_Z), lambda=mean_Z) # this is if mean of Poisson is Gamma
# Z <- rpois(length(mean_Z), lambda=rexp(length(mean_Z), rate=1/mean_Z))
Z <- rpois(length(mean_Z), lambda=mean_Z)
dim(Z) <- dim(mean_Z)

Expression levels, normalized

# order samples by proportion
w_ord <- order(apply(w, 1, which.max), rowMaxs(w))

Znorm <- Z / colMeans(Z)[col(Z)]
Znorm <- Znorm / rowMeans(Znorm)
image(t(Znorm[w_ord,x_ord]), xlab='gene', ylab='sample')

plot of chunk simit10

class_model_code1 <- "
data {
    int num_genes;
    int num_samples;
    int num_cell_types;
    int Z[num_samples, num_genes]; // Z[i,j] is expression level of jth gene in ith sample
}"

class_model_code2 <- "
parameters {
    simplex[num_cell_types] w[num_samples];
    //   w[i][k] is proportion of sample i that is cell type k
    matrix<lower=0>[num_cell_types, num_genes] x;
    //   x[k,j] is typical expression level of gene j in cell type k
    vector<lower=0>[num_samples] y;
    //   y[i] is total expression level of sample i
    vector<lower=0>[num_genes] x_mean;
    //   x_mean[j] is 'typical' expression level for cell type j across all cell types
    real<lower=0> eta; // SD of x about x_mean
    vector<lower=0>[num_cell_types] alpha; // mean of dirichlet
}"

class_model_code3 <- "
model {
    matrix[num_samples, num_genes] mean_Z; // mean_Z[i,j] is 'expected' expression of gene j in sample i
    for (i in 1:num_samples) {
        for (j in 1:num_genes) {
            // matrix mult version?
            // mean_Z[i,j] = y[i] * (w[i] * x[,j]); // the first * is normal; the second is matrix mult
            // for loop version:
            mean_Z[i,j] = 0.0;
            for (k in 1:num_cell_types) {
                mean_Z[i,j] += y[i] * w[i][k] * x[k,j];
            }
        }
        Z[i,] ~ poisson(mean_Z[i,]);
    }
    y ~ normal(0, 10);
    for (k in 1:num_cell_types) {
        x[k,] ~ normal(x_mean, eta * x_mean); // maybe this should be lognormal
    }
    for (i in 1:num_samples) {
        w[i] ~ dirichlet(alpha);
    }
    alpha ~ exponential(1);
    eta ~ normal(0, 4);
    x_mean ~ gamma(0.5, 0.5e-4);
}
"

class_model <- stan_model(model_code=paste(c(class_model_code1, class_model_code2, class_model_code3), collapse="\n"))

class_fit <- optimizing(class_model,
                      data=list(num_genes=num_genes,
                                num_samples=num_samples,
                                num_cell_types=num_cell_types,
                                Z=Z))

The results!

We do a very good job of estimating mixture proporitons, \(w\):

opt_w <- class_fit$par[grepl("^w", names(class_fit$par))]
dim(opt_w) <- c(num_samples, num_cell_types)
cor(w, opt_w)

##             [,1]       [,2]        [,3]        [,4]        [,5]
## [1,] -0.55251630 -0.1093521 -0.73624936 -0.35751026  0.98772093
## [2,]  0.92630791 -0.5112326  0.09646298 -0.10969093 -0.28923879
## [3,] -0.08445503 -0.5009004 -0.33497043  0.95348430 -0.07801488
## [4,] -0.42205844  0.8740198 -0.07369501 -0.44996998  0.04533575
## [5,]  0.02055711  0.2733963  0.88992645 -0.07613433 -0.51310686

We are estimating overall mean expression less well:

opt_x_mean <- class_fit$par[grepl("^x_mean", names(class_fit$par))]
plot(x_mean, opt_x_mean, xlab='true mean expression', ylab='estimated mean expression')

plot of chunk in_class_results2

And, similarly, we estimate cell-type-specific expression less well:

opt_x <- class_fit$par[grepl("^x\\[", names(class_fit$par))]
dim(opt_x) <- c(num_cell_types, num_genes)
cor(t(x), t(opt_x))

##           [,1]      [,2]      [,3]      [,4]      [,5]
## [1,] 0.2826979 0.4820000 0.3362926 0.2461818 0.5468449
## [2,] 0.6127233 0.4143438 0.3468935 0.2641375 0.2284147
## [3,] 0.3908400 0.4193718 0.3373547 0.4889322 0.2725535
## [4,] 0.3134202 0.6438825 0.3642748 0.2071764 0.2544726
## [5,] 0.3798235 0.4908773 0.4712599 0.2872747 0.2357471