Week 17 - MANOVA and DFA

February 19 & 21, 2019

TUESDAY - Finish PCoA and NMDS

Goals for this week

  • Finish PCoA and NMDS
  • Introduce Clustering
  • Multiple Analysis of Variance (MANOVA)
  • Discriminant Function Analysis (DFA)

Principal Coordinate Analysis (PCoA) and Multidimensional Scaling (MDS)

How does PCoA differ from PCA

  • PCoA uses dissimilarity measures among objects (not variables) to perform the eigenanalysis
    • dissimilarity metric if the measure is metric (the distance scale has meaning)
    • dissimilarity index or dissimilarity measure if non-metric (only the ordering matters)
    • always good to know and specify which you’re using
  • Dissimilarity measures can still have distributional assumptions about the data
    • when using metrics (e.g. Euclidean distance)
    • because we’re using linear models as we did in PCA
  • Dissimilarity indices are useful when variables are
    • measured on very different scales
    • when some of the variables include a lot of zero values

Dissimilarity measures and metrics

  • PCoA is also called
    • classical multidimensional scaling or
    • metric multidimensional scaling.
  • The major benefits of PCoA are
    • the ability to choose a different distance measures
    • when you have a lot more variables than observations
    • preferred measures most closely represent biologically meaningful differences between objects
  • When Euclidean distance is used, PCoA is the same as PCA.
  • When a non-metric dissimilarity is used
    • we call it an NMDS
    • need to use different downstream analyses (e.g. PerMANOVA)

Dissimilarity measures and metrics

  • Metric coefficients satisfy four criteria:
    • the distance between identical objects is “0”, which is the minimum of the coefficient;
    • if objects are not identical, a metric coefficient will have a positive value;
    • symmetry, whereby the distance between A and B is the same as that between B and A;
    • conformance to the triangle inequality, which specifies that for any triangle with vertices A, B, and C, the path defined by the line AB is always shorter than the path defined by AC + CB or BC + CA.

Dissimilarity metrics

Dissimilarity indices that are semi-metric


  • Binary data
    • Jaccard
    • Simple matching coefficient
    • Sorensen
  • Raw count data
    • Bray-Curtis
    • Steinhaus
  • Mixed data
    • Gower’s

PCoA Analysis steps

  • Closely related to PCA by using a metric dissimilarity
  • Starts with an n-by-n matrix of object dissimilarities
  • Transformed and then subjected to eigenanalysis
  • As in PCA,
    • most of the information will be in the first few dimensions
    • the eigenvectors are scaled to obtain weightings, but it’s difficult to relate these back to the original variables
    • However, the coefficients of these eigenvectors are then used to position objects on each PCoA via their new derived scores
  • If Euclidean distance was used for the dissimilarity matrix PCA and PCoA will be very similar

R Interlude - Principal Coordinate Analysis (PCoA)

R Interlude - PCoA

R Interlude - PCoA

  • PCoA is a distance-based ordination method that can be performed via the capscale() function in the package VEGAN.
  • You can also use cmdscale() in the base installation, but you would need to produce a distance matrix from the original data.
  • The capscale() function is designed for another purpose, so the syntax is a bit different than the other ordination methods, but it can be used to perform PCoA:

R Interlude - PCoA

  • must specify dataframe~1 (where dataframe is the sample/variable data matrix) to perform PCoA
  • must specify a distance from distances provided in vegdist()

R Interlude - PCoA

  • The vegdist() function has more distances, including some more applicable to (paleo)ecological data:
  • Distances available in vegdist() are: “manhattan”, “euclidean”, “canberra”, “bray”, “kulczynski”, “jaccard”, “gower”, “altGower”, “morisita”, “horn”, “mountford”, “raup” , “binomial” or “chao” and the default is bray or Bray-Curtis.
  • Try using the different distances in vegdist() to see how it affects your results

R Interlude - PCoA

  • If you use the cmdscale functions as part of the basic R installation you will need to have a data frame containing only numerical data (there can be row names).
  • The default arrangement is to have the samples (or sites) in rows and the measured variables (or counts) in columns.
  • You can transpose a data frame (or matrix) swap the rows to columns and vice versa using the transpose function t():
  • transposes data frame so rows become columns and vice versa

Multidimensional Scaling (MDS)

MDS goals and benefits

  • Data reduction - reduce a lot of variables to a smaller number of axes that group objects that adequately summarize the original information.
  • Scaling - Reveal patterns in the data - especially among objects - that could not be found by analyzing each variable separately. Directly scales objects based on dissimilarities between them.
  • Ordination plots can show these multivariate dissimilarities in lower dimensional space.
  • However, specifically designed to graphically represent relationships between objects in multidimensional space, and thus subsequent analysis is more difficult.
  • MDS is more flexible than PCA in being able to use just about any dissimilarity measure among objects, not just Euclidean Distance.

NMDS goals and benefits

  • Nonmetric multidimensional scaling (NMDS or NMS) is an ordination technique that differs in several ways from nearly all other ordination methods.
  • In NMDS, a small number of axes are explicitly chosen prior to the analysis and the data are fitted to those dimensions
  • Most other ordination methods are analytical, but NMDS is a numerical technique that iteratively seeks a solution and stops computation when a solution is found.

NMDS goals and benefits

  • NMDS is not an eigenvalue-eigenvector technique like PCA. As a result, an MDS ordination can be rotated, inverted, or centered to any desired configuration.
  • Unlike other ordination methods, NMDS makes few assumptions about the nature of the data (e.g. PCA assumes linear relationships) so is well suited for a wide variety of data.
  • NMDS also allows the use of any distance measure of the samples, unlike other methods which specify particular measures (e.g. Euclidean via covariance or correlation in PCA).

NMDS goals and benefits

  • NMDS does suffer from two principal drawbacks, although these are becoming less important as computational power increases.
  • First, NMDS is slow, particularly for large data sets.
  • Second, because NMDS is a numerical optimization technique, it can fail to find the true best solution because it can become stuck on local minima.

MDS - one big ordination family

Finding the best ordination in MDS

Stress - the MDS analog of residuals in linear models

  • A Shepard diagram is the relationship of the dissimilarity and ordination distance
  • Fit a linear or non-linear regression between the two
  • The disparities are really just the residuals from this model
  • The disparities are then analyzed to see how well the new ordination captures the original information
  • One measure is called Kruskal’s Stress

Stress - the MDS analog of residuals in linear models

Stress - the MDS analog of residuals in linear models

  • The lower the stress value, the better the match
  • When the relationship is linear, the fit can be metric (MDS)
  • When it’s not, the relationship is based on rank orders non-metric (NMDS)
  • NMDS is quite robust and is often used in areas such as ecology and microbiology
  • Stress values greater than 0.3 indicate that the fit is no better than arbitrary, and we’d really like a stress that is 0.15 or less

Scaling plots or ordinations

  • Relates the objects to one another in the derived variable space
  • Really only the relative distances between objects that are important for interpretation

Testing hypotheses in MDS

  • What if we have factor variables that we’d like to use in an analysis?

  • ANOVA on PC scores
  • MANOVA on the original variables
    • MANOVA on the derived axis scores from an MDS
    • ANOSIM or perMANOVA on the derived axis scores from an nMDS

Analysis of Similarities (ANOSIM)

  • Very similar to ANOVA
  • Uses Bray-Curtis dissimilarities, but could use any measure
  • Calculates a test statistic of the rank dissimilarities within as compared to among groups
  • Uses a randomization procedure, so it’s pretty robust to assumption violation
  • Complex tests (nesting or factorial) are difficult to do in ANOSIM

Non-parametric MANOVA (perMANOVA)

  • Similar to MANOVA
  • Sum of square (SS) partitioning to non-metric measures
  • Can be applied to any design structure
  • n-by-n matrix of dissimilarities for pairs of objects h and i
  • partition the sum of square (SS) dissimilarities
  • perform F-tests as for ANOVA

CLUSTERING


  • Goal - Partition a heterogenous overall group of data into a set of relatively homogeneous smaller groups (subsets)
  • Natural extension of MDS approaches
  • Used (misused) extensively in many areas of biology
  • Sometimes also called network analysis

CLUSTERING

CLUSTERING

  • Mostly used as an exploratory or discovery approach in combination with other analyses.
  • Usually used without a priori ideas of group membership.
  • Often explore a range of possible clusters (K=1, 2, 3, …., n)
  • Sometimes called K-means clustering
  • Has four main steps
    • Acquire raw data of n objects measured on p variables
    • Derive an inter object association or dissimilarity matrix
    • Cluster formation using a clustering algorithm
    • Visualization and comparison of the clusters
  • There are numerous options, especially for step 3 - Neighbor joining, UPGMA, etc…

CLUSTERING

  • Could use any form of similarity
    • Covariance or correlation matrix
    • Euclidean distances (metric)
    • Dissimilarity measures (metric or nonmetric)
    • Matching-type measures of similarity (presence-absence)
  • In reality both objects and variables could be clustered simultaneously

CLUSTERING

Cluster Analysis - cluster formation

Hierarchical clustering process

Agglomerative and Divisive hierarchical clustering

  • Long history in phylogenetics
  • Start with all objects separated and then bring them together sequentially
  • Main difference among algorithms is how dissimilarities among clusters are recalculated as the objects are brought together
    • Single linkage (nearest neighbor) - minimum dissimilarity among all pairs of two objects in two clusters
    • Complete linkage (furthest neighbor) - maximum dissimilarity among all pairs of two objects clusters
    • Average linkage (group mean) - average dissimilarity among all pairs of two objects in each of two clusters - often used

Agglomerative and Divisive hierarchical clustering

  • UPGMA - Unweighted Pair-Groups Method using Arithmetic averages.
    • Commonly used average linkage clustering algorithm
    • Also WPGMA (weighted) and UPGMC (based on centroids)
  • Divisive hierarchical clustering has a long history in ecological data analysis
  • Start with a large group and proceed by splitting into smaller groups
    • Twinspan - two-way indicator species analysis
  • Agglomerative methods are now being used more often in ecology as well as phylogenetic and genomics

Cluster Analysis - Non-hierarchical clustering

  • Non-hierarchical clustering does not assume a strict bifurcating (tree-like) structure
  • Instead, the clustering can be representing in a reticulating (network-like) structure
  • In contrast to hierarchical clustering, objects can be reassigned to clusters throughout the analysis
  • Also called fuzzy clustering in that objects can belong to different clusters with a defined probability
  • Becoming more common in population genetics, for example, with Bayesian posterior probabilities (PP) of belonging to a population.

Cluster Analysis - Non-hierarchical clustering

  • K-means non-hierarchical clustering is becoming a popular approach
    • Start with a pre-defined number of clusters (k), and
    • Next assign objects to clusters
    • Then iteratively re-evaluate membership to clusters
    • Lastly maximize the ratio of between to within dissimilarity
  • Can be represented as a network diagram or posterior probability plot

R Interlude - MultiDimensional Scaling

R Interlude - MDS

  • Use VEGAN again
  • We’ll analyze the yeast RNAseq dataset again with samples as rows and genes as columns.

R Interlude - MDS

  • Generate a dissimilarity matrix for all samples using vegdist().
  • We use decostand() to “normalize,” which accounts for differing total read #s per sample.
  • If the expression data are already normalized (e.g. copies per million), it is not needed.
  • The vegdist() function has more distances
  • Distances available in vegdist() are: “manhattan”, “euclidean”, “canberra”, “bray”, “kulczynski”, “jaccard”, “gower”, “altGower”, “morisita”, “horn”, “mountford”, “raup” , “binomial” or “chao” and the default is bray or Bray-Curtis.

R Interlude - MDS

  • Turn the raw data matrix into a dissimilarity matrix
  • The decostand function is a form of normalization
  • Perform the clustering of the samples using multidimensional scaling.
  • Examine the ‘stress’ values of moving from a higher to lower dimensionality
  • Usually a value of 0.15 or lower is considered acceptable

R Interlude - MDS

  • Let’s take a look at how the dissimilarities among samples maps onto the ordination distance. Notice that there is a fit with the data, but we’re no longer assuming consistent linearity over the entire data set.
  • What does the R^2 value tell you? Is the model accurately predicting the observed dissimilarity?

R Interlude - MDS

  • Now let’s look at the grouping of the samples in this lower dimensional space. Any Clustering
  • Rerun the ordination and add all of the data (genes) as well to the plot. How does this plot compare to your first plot? What’s all that red stuff?

R Interlude - MDS

  • We can run a PCA on our data as well, which is a metric analysis that utilizes Euclidean distances
  • What do you notice about the eignevalues of the PCs?
  • How many original variables were there? How many eigenvectors will there be?
  • Showing both the locations of the samples and the variables.

R Interlude - MDS

  • Try different plots that show one or the other or both
  • What are these plots showing? What does that scaling argument do?
  • What is in red? What is in black?

R Interlude - MDS

  • We can use the dissimilarity matrices to perform hierarchical clustering.
  • Try both the non-normalized (clus.dis1) and normalized (clus.dis2) distances.

R Interlude - MDS

  • Now, try these different versions of clustering. What is different about them?

R Interlude - MDS

  • Lastly, let’s ask R to cut the tree into several clusters for us.
  • Try it with different numbers of clusters and the different types of clustering from above.

THURSDAY - MANOVA and DFA

Goals for this week


  • Finish PCoA and NMDS
  • Introduce Clustering
  • Multiple Analysis of Variance (MANOVA)
  • Discriminant Function Analysis (DFA)

Multivariate Analysis of Variance (MANOVA) and Discriminant Function Analysis (DFA)

Conceptual overview of MANOVA & DFA

  • Both focus on the analysis of a factor variable but with multiple continuous response variables

  • MANOVA - explicitly test the difference between levels of a factor(s) using the first discriminant function

  • DFA - defines the set of linear discriminant functions that most clearly differentiate the levels of a factor variable in multivariate space

  • MANOVA focuses on hypothesis testing and DFA is used more for classification

Conceptual overview of MANOVA & DFA


Multivariate Analysis of Variance (MANOVA)

Why use MANOVA?

  • When we’re interested in the relationship of two or more response variables and one or more predictor variables
    • If we’re interested in each of the responses, then univariate ANOVA is more appropriate
    • If we are interested in group differences when all of the response variables are considered simultaneously, MANOVA is more appropriate
  • The Hypothesis is now that there are group effects on the combinations of variables by comparing group centroids for two or more variables

Single Factor MANOVA

\[ z_{ik} = constant + c_1y_{i1} + c_2y_{i2} + c_3y_{i3} + ... + c_py_{ip}\]

  • MANOVA uses the linear combination (z) of the p response variables that maximizes the ratio of between and within group variances of z
  • This linear combination is also called a Discriminant Function (MANOVA uses DF1)
  • The between-groups, within-groups and total SS used in ANOVA are replaced by sums-of-squares and sums-of-cross-products matrices (S)
  • One matrix is formed for between groups (the H matrix), nother matrix is formed within groups (the E matrix), and the final matrix is the total (the T matrix)

Single Factor MANOVA

  • Multiple linear response variables and one predictor factor variable with two or more levels
  • Use matrix algebra to divide the H matrix by the E matrix
  • The resulting matrix product is decomposed into eigenvalues and eigenvectors to determine how much of the between group variation is explained by the new variables
  • The linear combination that produces the largest eigenvalue is the eigenvector that maximizes the ratio of between-group and within-group variance

Single Factor MANOVA - linear model notation

Single Factor MANOVA - hypothesis tests

  • The null hypothesis is that factor variable effect is zero with respect to the linear combination of response variables
  • The null can be tested using test statistics based on one of the measures of variance in the matrix (either the determinant or the trace)
  • Like in ANOVA we use the ratio of “between-groups” variance matrix to “within-groups” variance matrix.
  • If this ratio is high, the variance explained by grouping according to the factor levels is meaningful.
  • If low, we’re not explaining additional variation in our latent variable by including the predictor in the model.

Single Factor MANOVA - Statistics

  • Common statistics
    • Wilk’s lambda
    • Hotelling-Lawley trace
    • Pillai’s Trace
    • Roy’s Largest root
  • All of these are interpreted the same way as other statistics
  • Larger statistic values and smaller p-values are more significant
  • Wilk’s, Hotelling’s and Pillai’s produce identical F tests when there are only two groups and become Hotelling’s T2 statistic

MANOVA results table


MANOVA results table

MANOVA contrasts

  • A priori contrasts can be performed as before
    • analogous to planned contrasts in single factor ANOVA
  • Unplanned, post-hoc comparisons of group means are much more difficult
  • What can you do?
    • Perform univariate ANOVA on each factor (or pseudofactor of combined factors)
    • Step-down analysis for model selection
    • Discriminant Function investigation by looking at the first eigenvector

Single Factor MANOVA - Assumptions

  • Normality - less important, especially for Pillai’s trace test statistic
  • Homogeneity of variance (and covariance) - important, but tricky
  • Data on very different scales can also be a problem - standardize
  • Outliers - multivariate tests are very sensitive to these
    • Pillai’s trace is the most robust to deviation of homogeneity of variance
    • Can use Mahalanobis distance to detect and do something with outliers

Outliers and Missing Data

  • Outliers in multivariate space can be particularly misleading but difficult to detect
  • Can use Mahalanobis Distance, which is the square root of the distance in multivariate space of the object from the centroid
  • Options
    • transform
    • remove
  • As with univariate analyses the removal of outliers should be justified, or both analyses should be presented

Identifying Outliers

Identifying Outliers

Outliers and Missing Data


  • Missing observations are particularly vexing for multivariate analyses
    • MCAR - missing completely at random
    • MAR - missing at random
    • MBC - missing but correlated
  • Three approaches
    • Deletion - remove the entire object from the data set
    • Imputation - infer the value based upon other values
    • Maximum Likelihood and Expectation Maximization

R interlude - MANOVA using stickleback RNAseq data

R interlude | MANOVA stickleback RNAseq

  • This is the same dataset as you’ve used previously but with a few modifications
  • FACTORS - Genotype, microbiota, a combined genotype and microbiota
  • RESPONSE - 600 genes as counts
  • Do these variables co-vary??? (try some scatterplots)

R interlude | MANOVA stickleback RNAseq


  • Detect and then show list of outliers based on Mahalanobis distances

R interlude | MANOVA stickleback RNAseq


  • Test for multivariate normality
  • Are our data multivariate normal?
  • specify main effect 1

R interlude | MANOVA stickleback RNAseq


  • Fit the MANOVA model
  • Are the F-ratios and p-values different?
  • What does summary.aov() do, and what does this tell you about the data?
  • Now change the main effect variable around so that it has the four microbiota&genotype levels.

R interlude | MANOVA stickleback RNAseq


  • Fit the MANOVA model with a new combined effect (“genotype-microbiota combination”):
  • Any differences relative to before?

R interlude | MANOVA stickleback RNAseq


  • Now change the model around so that it is a factorial MANOVA with microbiota and genotype as separate factors.
  • Any differences relative to before?

R interlude | MANOVA stickleback RNAseq


  • Now include 10 response variables, re-evaluate assumptions, and re-fit the factorial model.
  • Now include 100 response variables and re-fit the factorial model.

Discriminant Function Analysis (DFA)

Conceptual overview of MANOVA & DFA

  • Both focus on the analysis of a factor variable but with multiple continuous response variables

  • MANOVA - explicitly test the difference between levels of a factor(s) using the first discriminant function

  • DFA - defines the set of linear discriminant functions that most clearly differentiate the levels of a factor variable in multivariate space

  • MANOVA focuses on hypothesis testing and DFA is used more for classification

Conceptual overview of MANOVA & DFA


Steps involved in Discriminant Function Analysis (DFA)

  1. Run MANOVA to test for a group difference
    • If no significant difference, DFA is not be useful
    • Single factor ANOVA on each response variable
  2. Calculate Linear Discriminant Functions (LDF) scores
    • Evaluate the distributions of these scores
    • Do they have separate distributions for the groups?
  3. Classify observations into groups
    • Generate conditional probabilities for each observation belonging to a given group
    • Determine the percentage of ‘correct’ assignments
    • Possible to “train” LDF with one data set, and then assign individuals from a new data set to factor levels

Steps involved in Discriminant Function Analysis (DFA)


Discriminant Function Scores

R INTERLUDE - Discriminant Function Analysis (using the same RNA-seq data)

R INTERLUDE - DFA stickleback RNAseq


  • This is the same dataset as you’ve used previously but with a few modifications
  • FACTORS - Genotype, microbiota, a combined genotype and microbiota
  • RESPONSE - 600 genes as counts

R INTERLUDE - DFA stickleback RNAseq


  • What information do the different components of clay_lda contain?
  • Show results focused on first column of the DFA weightings

R INTERLUDE - DFA stickleback RNAseq


  • Now, repeat the analysis but use the first 100 genes instead of just 3
  • Don’t worry about the warning

R INTERLUDE - DFA stickleback RNAseq


  • Now, Let’s predict the DF scores for the original objects, based on the 100 genes.
  • What information does the predict() output contain?

R INTERLUDE | Prediction using biomarker data

R INTERLUDE | Biomarker DFA

  • Here we have 200 patients and their prior histology-based diagnoses. There are three diagnoses (“benign,” “sarcoma 1,” and “sarcoma 2”). Also included are data from a panel of 6 qPCR-based biomarkers. This set of biomarkers is considered largely congruent with histology for cancer diagnosis, but more accurate. Specifically, the histology approach incorrectly leads to a sarcoma diagnosis (when the tumor is actually benign) about 2% of the time, whereas the biomarker approach has no such problem.

  • The goal here is to use DFA to separate sarcoma from benign diagnoses via the 6 biomarkers, but using the prior histology diagnosis as an informative grouping variable. This way we can evaluate whether any of the prior diagnoses may be erroneous.

R INTERLUDE | Biomarker DFA

  • Read in the data and specify our 6 response variables
  • Specify “diagnosis” as our categorical predictor variable
  • First, run a quick MANOVA on the data (as before) to test the hypothesis that prior diagnosis significantly affects where patients lie in multi-dimensional (biomarker) space…
  • Next, use the lda() function to find the discriminant functions.
  • Remember, the algorithm is deriving the “latent” variable(s) that best separate our samples (based on the prior diagnosis) in 6-dimensional space.

R INTERLUDE | Biomarker DFA

  • Show results focused on first column of the LDA weightings
  • Why are there only 2 Linear Discriminant Variables?

R INTERLUDE | Biomarker DFA

  • Now, let’s predict the LD scores for the original objects (our patients)
  • What other information does diag_lda contain?

R INTERLUDE | Biomarker DFA

  • Now store the LD1 and LD2 scores (the last item in diag_lda) as a new data frame.
  • Make a histogram of the LD scores for the different diagnosis groups.
  • Do you notice anything strange about any of the distributions?

R INTERLUDE | Biomarker DFA

  • Now make a bivariate plot for DF1 and DF2 with diagnosis groups labeled uniquely.
  • Labeling by group is up to you! (Hint: can set col and/or pch params using ifelse() functional statements).
  • Notice anything weird about any observations in the plot??? What do they mean???
  • Let’s say you had a handful of new patients with no prior diagnosis and only the qPCR data from the six biomarkers.
  • Could re-running the LDA with those samples help you make the appropriate diagnoses?