\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\SE}{SE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\cor}{cor} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\MVN}{MVN} \DeclareMathOperator{\LogNormal}{logNormal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \DeclareMathOperator{\StudentsT}{StudentsT} \DeclareMathOperator{\Weibull}{Weibull} \newcommand{\given}{\;\vert\;} \]

Dimension reduction and PCA

Peter Ralph

Advanced Biological Statistics

Multivariate data

We need multivariate statistics when we’re dealing with lots (more than one?) variables.

Goals: for this and next week:

  1. Describe and visualize multivariate data.

  2. Distinguish groups using many variables (e.g., cases from controls).

Both things involve dimension reduction, because we see things in \({} \le 3\) dimensions, and categorization is low-dimensional.

The setting

We have observations of many variables:

\[ \mathbf{x}_i = (x_{i1}, x_{i2}, \ldots, x_{im}) . \]

so

\[ x_{ij} = (\text{measurement of $j^\text{th}$ variable in $i^\text{th}$ subject}) \]

head(diabetes, 20)
##      y     age    sex     bmi     map      tc      ldl      hdl     tch     ltg     glu
## 1  151  0.0381  0.051  0.0617  0.0219 -0.0442 -3.5e-02 -0.04340 -0.0026  0.0199 -0.0176
## 2   75 -0.0019 -0.045 -0.0515 -0.0263 -0.0084 -1.9e-02  0.07441 -0.0395 -0.0683 -0.0922
## 3  141  0.0853  0.051  0.0445 -0.0057 -0.0456 -3.4e-02 -0.03236 -0.0026  0.0029 -0.0259
## 4  206 -0.0891 -0.045 -0.0116 -0.0367  0.0122  2.5e-02 -0.03604  0.0343  0.0227 -0.0094
## 5  135  0.0054 -0.045 -0.0364  0.0219  0.0039  1.6e-02  0.00814 -0.0026 -0.0320 -0.0466
## 6   97 -0.0927 -0.045 -0.0407 -0.0194 -0.0690 -7.9e-02  0.04128 -0.0764 -0.0412 -0.0963
## 7  138 -0.0455  0.051 -0.0472 -0.0160 -0.0401 -2.5e-02  0.00078 -0.0395 -0.0629 -0.0384
## 8   63  0.0635  0.051 -0.0019  0.0666  0.0906  1.1e-01  0.02287  0.0177 -0.0358  0.0031
## 9  110  0.0417  0.051  0.0617 -0.0401 -0.0140  6.2e-03 -0.02867 -0.0026 -0.0150  0.0113
## 10 310 -0.0709 -0.045  0.0391 -0.0332 -0.0126 -3.5e-02 -0.02499 -0.0026  0.0677 -0.0135
## 11 101 -0.0963 -0.045 -0.0838  0.0081 -0.1034 -9.1e-02 -0.01395 -0.0764 -0.0629 -0.0342
## 12  69  0.0272  0.051  0.0175 -0.0332 -0.0071  4.6e-02 -0.06549  0.0712 -0.0964 -0.0591
## 13 179  0.0163 -0.045 -0.0288 -0.0091 -0.0043 -9.8e-03  0.04496 -0.0395 -0.0308 -0.0425
## 14 185  0.0054  0.051 -0.0019  0.0081 -0.0043 -1.6e-02 -0.00290 -0.0026  0.0384 -0.0135
## 15 118  0.0453 -0.045 -0.0256 -0.0126  0.0177 -6.1e-05  0.08177 -0.0395 -0.0320 -0.0756
## 16 171 -0.0527  0.051 -0.0181  0.0804  0.0892  1.1e-01 -0.03972  0.1081  0.0361 -0.0425
## 17 166 -0.0055 -0.045  0.0423  0.0494  0.0246 -2.4e-02  0.07441 -0.0395  0.0523  0.0279
## 18 144  0.0708  0.051  0.0121  0.0563  0.0342  4.9e-02 -0.03972  0.0343  0.0274 -0.0011
## 19  97 -0.0382 -0.045 -0.0105 -0.0367 -0.0373 -1.9e-02 -0.02867 -0.0026 -0.0181 -0.0176
## 20 168 -0.0273 -0.045 -0.0181 -0.0401 -0.0029 -1.1e-02  0.03760 -0.0395 -0.0089 -0.0549
head(lizards, 20)
##     ID Hurricane   Origin    Sex SVL Femur Tibia Metatarsal LongestToe Humerus Radius Metacarpal LongestFinger FingerCount ToeCount MeanFingerArea MeanToeArea
## 1  537     After Pine Cay   Male  49  10.4  11.9        7.5        7.4     8.7    8.0        2.2           3.2          10       12           1.33         2.4
## 2  539     After Pine Cay Female  40   8.7   9.8        6.2        6.2     8.0    6.5        2.4           3.5          10       13           0.96         1.5
## 3  540     After Pine Cay   Male  58  12.9  14.8        9.4        9.6    11.7    9.5        3.5           5.1          14       15           2.63         4.1
## 4  541     After Pine Cay Female  43   8.6  10.3        6.6        6.3     7.4    6.6        2.8           3.5          11       12           1.18         1.9
## 5  542     After Pine Cay Female  46  10.3  11.0        6.9        7.0     7.7    7.2        2.5           3.4          11       13           1.38         2.5
## 6  543     After Pine Cay Female  47  10.0  10.8        6.8        7.2     8.4    7.2        2.4           3.3          12       14           1.43         2.0
## 7  544     After Pine Cay   Male  53  12.7  12.4        7.9        8.2     9.9    8.4        3.1           4.3          11       14           1.86         3.0
## 8  545     After Pine Cay   Male  57  11.9  12.9        8.2        8.0    10.3    8.8        3.2           4.2          12       12           2.56         4.0
## 9  546     After Pine Cay Female  43  10.0  11.1        6.9        6.7     7.8    7.1        2.8           3.4          12       13           1.08         1.8
## 10 547     After Pine Cay   Male  54  11.3  13.1        7.8        7.7    10.2    8.7        3.0           4.1          12       16           2.30         3.9
## 11 548     After Pine Cay Female  41   9.3  10.5        6.7        6.0     7.2    6.8        2.4           2.9          12       13           1.08         1.5
## 12 549     After Pine Cay Female  45   9.2  10.1        6.7        6.5     7.6    6.5        2.7           3.2          11       13           1.31         2.3
## 13 550     After Pine Cay   Male  61  13.9  14.9        8.8        8.4    11.6    9.5        3.0           4.4          13       15           2.47         4.3
## 14 553     After Pine Cay Female  47   9.4  10.6        6.7        6.3     7.8    7.2        2.3           3.3          11       14           1.37         1.9
## 15 554     After Pine Cay   Male  58  11.7  13.8        8.3        8.4    10.8    8.6        3.3           4.1          11       15           2.21         3.6
## 16 555     After Pine Cay   Male  49  10.8  12.2        7.8        7.2     9.0    7.8        2.7           3.6          12       14           1.37         2.2
## 17 558     After Pine Cay   Male  54  12.0  13.3        8.5        8.7     9.8    8.6        2.9           4.1          14       15           2.07         3.6
## 18 559     After Pine Cay   Male  53  11.8  13.0        7.8        7.2     9.6    8.5        2.8           3.8          13       15           1.99         2.9
## 19 560     After Pine Cay   Male  54  12.2  12.9        8.1        8.1     9.9    8.3        2.4           4.1          12       15           1.83         3.2
## 20 561     After Pine Cay Female  41   9.1   9.6        5.9        5.4     6.8    6.4        2.0           3.2          11       11           0.95         1.3

Principal component analysis (PCA)

Primary aims of PCA

  • Variable reduction - reduce to a small number of “composite” variables that adequately summarize the original information (dimension reduction).

  • Ordination - Reveal patterns in the data that could not be found by analyzing each variable separately.

Quick example: diabetes data

plot of chunk r prcomp_dia

Quick example: diabetes data

PC1 is correlated with outcome: plot of chunk r dia2

Quick example: lizards

plot of chunk r prcomp_cc

Quick example: lizards

PC1 is size: plot of chunk r cc2

PCA: how it works

Reducing dimesions

Say we’ve got a data matrix \(x_{ij}\) of \(n\) observations in \(m\) variables,

but we want to make do with fewer variables - say, only \(k\) of them.

Idea: Let’s pick a few linear combinations of the variables that best captures variability within the dataset. For instance, strongly correlated variables will be combined into one.

Notation:

  1. These new variables are the principal components, \[u^{(1)}, \ldots, u^{(k)} . \]

  2. The importance of each variable to the principal components - i.e., the coefficients of the linear combination - are the loadings, \[v^{(1)}, \ldots, v^{(k)} . \]

  3. So, the position of the \(i\)th data point on the \(\ell\)th PC is \[ u_i^{(\ell)} = v_1^{(\ell)} x_{i1} + v_2^{(\ell)} x_{i2} + \cdots v_m^{(\ell)} x_{im} . \]

Geometric interpretation

  1. The loadings are the directions in multidimensional space that explain most variation in the data.

  2. The PCs are the coordinates of the data along these directions.

  3. These directions are orthogonal, and the resulting variables are uncorrelated.

A model for PCA

The approximation to the data matrix using only \(k\) PCs is: \[ x_{ij} \approx u_i^{(1)} v_j^{(1)} + u_i^{(2)} v_j^{(2)} + \cdots + u_i^{(k)} v_j^{(k)} . \]

This is the best possible \(k\)-dimensional approximation, in the least-squares sense, so is the MLE for the low-dimensional model with Gaussian noise.

Also known as

PCA is also called “eigenanalysis” because (as usually computed), the PCs are eigenvectors of the covariance matrix.

The eigenvalues are \[ \lambda_\ell = \sum_{i=1}^n \left(u_i^{(\ell)}\right)^2, \] and they partition the total variance: \[ \lambda_1 + \lambda_2 + \cdots + \lambda_m = \sum_{ij} (x_{ij} - \bar x_j)^2 . \]

Interpretation

  • PCs: Observations that are close in PC space are similar.
  • Loadings: high values indicate a strong correlation with that PC (positive or negative).

Sometimes PCs are rotated to improve interpretability.

What next?

The PCs are nice new variables you can use in any analysis!

Example: Does mouse body size correlate with the top three climate PCs?

Beer

Example: local beer

beer <- read.csv("data/Beer_Specs.csv", stringsAsFactors=TRUE)
table(beer$Beer_Type)
## 
##           Ale    Citrus_IPA    Double_IPA        Helles        NW_IPA Oatmeal_Stout   Oktoberfest      Pale_Ale       Red_IPA Vanilla_Stout    Winter_Ale 
##            21            14            23            20            94            12             4             8            39            12            10
beer_vars <- c("Volume", "CO2", "Color", "DO", "pH", "Bitterness_Units", "ABV", "Real_Extract", "Real_Degrees_Fermentation", "Final_Gravity")
beer[,c("Beer_Type", beer_vars)]
##         Beer_Type Volume CO2 Color  DO  pH Bitterness_Units ABV Real_Extract Real_Degrees_Fermentation Final_Gravity
## 1             Ale    250 2.4   4.7  80 4.1               56 4.9          3.6                        68             1
## 2             Ale    178 2.4   4.7  86 4.2               53 5.0          3.4                        70             1
## 3             Ale     70 2.4   4.9  89 4.3               61 4.8          3.6                        68             1
## 4             Ale    102 2.4   5.0  89 4.3               58 4.7          3.6                        68             1
## 5             Ale    173 2.4   4.4  82 4.4               59 4.6          3.7                        67             1
## 6             Ale    254 2.4   4.8  94 4.4               57 4.8          3.5                        69             1
## 7             Ale    347 2.4    NA  93 4.2               58 4.7          3.6                        68             1
## 8             Ale    167 2.4   5.2  95 4.2               57 5.2          3.5                        70             1
## 9             Ale    175 2.4   4.8  98 4.2               54 4.8          3.6                        68             1
## 10            Ale    163 2.4   5.0  96 4.3               62 4.9          3.5                        69             1
## 11            Ale    250 2.4   4.7  91 4.3               64 4.7          4.0                        65             1
## 12            Ale    164 2.4   4.7  95 4.3               58 5.1          3.6                        70             1
## 13            Ale    166 2.5   4.5  95 4.2               54 5.0          3.3                        71             1
## 14            Ale    168 2.4   4.9  92 4.3               62 5.0          3.5                        70             1
## 15            Ale    328 2.4   4.3  94 4.1               54 4.9          3.5                        69             1
## 16            Ale    316 2.4   4.9  95 4.2               64 5.1          3.6                        70             1
## 17            Ale    333 2.4   4.4  96 4.2               52 5.0          3.4                        70             1
## 18            Ale    527 2.4   4.4  86 4.3               59 4.7          3.5                        68             1
## 19            Ale    168 2.4   4.4  90 4.2               55 4.9          3.6                        69             1
## 20            Ale    272 2.4   4.1  NA 4.4               55 4.6          3.8                        66             1
## 21            Ale    231 2.4   4.8  37 4.2               61 4.8          3.6                        68             1
## 22         Helles    183 2.6   2.7  79 4.3               29 5.4          3.6                        71             1
## 23         Helles    182 2.7   2.6  85 4.4               27 5.6          3.5                        73             1
## 24         Helles    187 2.6   2.7  94 4.2               26 5.5          3.5                        72             1
## 25         Helles    189 2.6   2.7  92 4.5               34 5.4          3.4                        72             1
## 26         Helles    274 2.6   2.5  98 4.3               31 5.4          3.3                        72             1
## 27         Helles    383 2.6   2.3 105 4.3               26 5.4          3.3                        72             1
## 28         Helles    377 2.6   2.5  96 4.3               23 5.4          3.3                        73             1
## 29         Helles     98 2.6   2.9  97 4.4               31 5.2          3.9                        68             1
## 30         Helles    374 2.7   2.4  92 4.3               24 5.4          3.4                        72             1
## 31         Helles    379 2.6   2.7  95 4.6               29 5.2          3.6                        70             1
## 32         Helles    375 2.6   2.3  94 4.4               26 5.4          3.3                        73             1
## 33         Helles    217 2.6   2.7  92 4.4               27 5.5          3.3                        73             1
## 34         Helles     75 2.7   2.7  94 4.4               25 5.4          3.5                        72             1
## 35         Helles    104 2.6   2.8  94 4.3               28 5.5          3.5                        72             1
## 36         Helles    120 2.6   2.7  97 4.3               19 5.5          3.3                        73             1
## 37         Helles    122 2.6   2.6  91 4.3               17 5.5          3.3                        73             1
## 38         Helles    180 2.6   2.9  85 4.4               22 5.5          3.4                        73             1
## 39         Helles    110 2.6   2.8  NA 4.6               27 5.5          3.4                        73             1
## 40         Helles     54 2.6   3.2  NA 4.5               25 5.5          3.4                        73             1
## 41         Helles    200 2.6   2.8  NA 4.5               26 5.5          3.4                        72             1
## 42         NW_IPA    468 2.4  10.4  80 4.3               74 6.9          5.9                        65             1
## 43         NW_IPA    471 2.4  10.4  80 4.3               72 6.9          6.0                        65             1
## 44         NW_IPA    433 2.4  11.2  76 4.4               74 7.0          5.6                        67             1
## 45         NW_IPA    436 2.4  11.3  80 4.4               73 7.0          5.7                        67             1
## 46         NW_IPA    376 2.4  13.0  76 4.4               78 6.7          5.9                        65             1
## 47         NW_IPA    396 2.4  10.8  76 4.4               82 7.1          5.9                        66             1
## 48         NW_IPA    471 2.4   9.8  77 4.4               87 6.9          5.7                        66             1
## 49         NW_IPA    399 2.4  10.7  81 4.4               85 7.0          5.6                        67             1
## 50         NW_IPA    463 2.4  10.1  87 4.4               80 6.8          5.6                        66             1
## 51         NW_IPA    469 2.4   9.7  94 4.5               73 6.7          5.7                        66             1
## 52         NW_IPA    476 2.4  10.2  92 4.5               74 6.5          6.0                        64             1
## 53         NW_IPA    407 2.4   9.5  83 4.5               85 6.6          5.8                        65             1
## 54         NW_IPA    392 2.4   9.3  85 4.5               72 6.6          5.8                        65             1
## 55         NW_IPA    403 2.4   9.5  91 4.4               79 6.6          5.8                        64             1
## 56         NW_IPA    458 2.4   9.9  84 4.4               78 6.8          5.5                        67             1
## 57         NW_IPA    476 2.4   9.4  84 4.5               81 6.5          5.8                        65             1
## 58         NW_IPA    293 2.4  10.0  94 4.5               78 6.7          5.9                        65             1
## 59         NW_IPA    468 2.4   8.8  93 4.5               79 6.8          5.7                        66             1
## 60         NW_IPA    473 2.4   8.9  96 4.5               83 6.7          5.8                        65             1
## 61         NW_IPA    466 2.4   9.0  92 4.5               72 6.7          5.7                        66             1
## 62         NW_IPA    467 2.4   9.8  95 4.4               82 6.6          5.7                        65             1
## 63         NW_IPA    442 2.4   9.2  87 4.4               74 6.8          5.6                        66             1
## 64         NW_IPA    386 2.4   9.1  85 4.3               83 6.7          5.7                        66             1
## 65         NW_IPA    376 2.4   8.9  98 4.4               85 6.6          5.7                        65             1
## 66         NW_IPA    394 2.4   8.2  96 4.4               81 6.7          5.6                        66             1
## 67         NW_IPA    401 2.4   8.9  98 4.4               81 6.7          5.6                        66             1
## 68         NW_IPA    454 2.4   9.5  95 4.5               82 6.7          5.7                        65             1
## 69         NW_IPA    453 2.4   9.3  91 4.4               86 6.7          5.7                        66             1
## 70         NW_IPA    389 2.4   9.7  98 4.4               83 6.8          5.8                        66             1
## 71         NW_IPA    431 2.4  10.0  97 4.3               83 6.8          5.8                        66             1
## 72         NW_IPA    469 2.4   9.7  98 4.4               81 6.8          5.9                        65             1
## 73         NW_IPA    458 2.4   9.7  91 4.3               85 6.9          5.7                        66             1
## 74         NW_IPA    459 2.4   9.7  96 4.5               88 6.7          5.8                        65             1
## 75         NW_IPA    452 2.4   9.6  96 4.4               83 7.0          5.5                        67             1
## 76         NW_IPA    469 2.4   9.8  96 4.4               88 7.0          5.5                        67             1
## 77         NW_IPA    467 2.4   9.5  91 4.4               92 7.0          5.6                        67             1
## 78         NW_IPA    464 2.4   9.2  91 4.4               79 6.8          5.7                        66             1
## 79         NW_IPA    454 2.4   9.6  90 4.4               88 6.9          5.5                        67             1
## 80         NW_IPA    450 2.4  10.4  93 4.4               93 7.1          5.5                        67             1
## 81         NW_IPA    463 2.4  10.0  92 4.4               79 7.0          5.4                        68             1
## 82         NW_IPA    299 2.4   9.9  92 4.4               91 6.3          6.3                        62             1
## 83         NW_IPA    312 2.4   9.9  94 4.3               84 6.9          5.5                        67             1
## 84         NW_IPA     94 2.4   9.7  98 4.5               94 6.5          6.2                        63             1
## 85         NW_IPA    464 2.4   9.7  90 4.4               72 6.9          5.7                        66             1
## 86         NW_IPA    392 2.4  10.1  92 4.4               94 7.3          5.3                        67             1
## 87         NW_IPA    442 2.4   9.6  94 4.4               84 6.9          5.7                        66             1
## 88         NW_IPA    449 2.4  10.1  98 4.3               70 6.9          5.8                        66             1
## 89         NW_IPA    229 2.4   9.9  94 4.4               79 6.9          5.7                        66             1
## 90         NW_IPA    377 2.4  10.2  91 4.4               84 6.9          5.6                        67             1
## 91         NW_IPA    309 2.4   9.2  93 4.4               79 6.9          5.5                        67             1
## 92         NW_IPA    393 2.5  10.1  92 4.5               84 6.7          5.9                        65             1
## 93         NW_IPA    315 2.4   9.8  95 4.5               80 7.0          5.6                        67             1
## 94         NW_IPA    469 2.4   9.9  98 4.4               99 6.8          5.8                        65             1
## 95         NW_IPA    458 2.4  10.2  96 4.5               91 7.1          5.3                        68             1
## 96         NW_IPA    305 2.4  10.1  99 4.3               85 6.8          5.9                        65             1
## 97         NW_IPA    461 2.4   9.8 102 4.2               72 6.9          5.8                        66             1
## 98         NW_IPA    394 2.4   9.3 108 4.2               83 7.1          5.3                        68             1
## 99         NW_IPA    226 2.4   9.7  87 4.2               82 6.5          6.3                        62             1
## 100        NW_IPA    473 2.4  10.1 102 4.3               86 7.0          5.5                        67             1
## 101        NW_IPA    460 2.4  10.0  95 4.2               87 7.0          5.5                        67             1
## 102        NW_IPA    310 2.4  10.7  98 4.3               82 6.8          5.5                        67             1
## 103        NW_IPA     67 2.4  10.0  97 4.2               83 6.7          5.8                        65             1
## 104        NW_IPA    457 2.4  10.6  98 4.4               86 6.8          5.9                        65             1
## 105        NW_IPA    480 2.4  10.3 110 4.4               87 6.9          5.7                        66             1
## 106        NW_IPA    321 2.4  10.4  98 4.4               78 6.7          5.9                        65             1
## 107        NW_IPA     71 2.4  10.4  95 4.3               68 7.0          6.2                        64             1
## 108        NW_IPA    471 2.4   9.8  99 4.4               78 7.0          5.5                        67             1
## 109        NW_IPA    467 2.4   9.9 102 4.4               82 6.9          5.7                        66             1
## 110        NW_IPA    481 2.4  10.2  98 4.3               74 7.1          5.7                        67             1
## 111        NW_IPA    460 2.4  10.1 102 4.3               76 7.2          5.3                        68             1
## 112        NW_IPA    318 2.4  10.2  94 4.3               75 7.0          5.2                        68             1
## 113        NW_IPA    239 2.4  12.1  95 4.4               76 6.9          5.3                        68             1
## 114        NW_IPA    482 2.4  10.2 102 4.4               83 6.7          5.6                        66             1
## 115        NW_IPA    477 2.4  10.3 101 4.3               74 6.8          5.2                        68             1
## 116        NW_IPA     75 2.4  10.3  91 4.3               69 6.8          5.8                        65             1
## 117        NW_IPA    474 2.4  10.0  98 4.4               84 6.8          5.4                        67             1
## 118        NW_IPA    470 2.4   9.5 105 4.4               66 6.8          5.3                        68             1
## 119        NW_IPA    320 2.4  10.2  98 4.4               73 6.8          5.1                        68             1
## 120        NW_IPA    309 2.4   9.6  82 4.4               71 6.9          5.3                        68             1
## 121        NW_IPA    480 2.4   9.4  86 4.4               72 7.0          5.2                        68             1
## 122        NW_IPA    396 2.4   9.8  83 4.5               77 7.2          5.2                        69             1
## 123        NW_IPA    319 2.4  10.7  89 4.6               72 6.4          5.9                        64             1
## 124        NW_IPA    312 2.4  11.1  84 4.5               77 6.2          6.5                        60             1
## 125        NW_IPA    492 2.4   9.5  89 4.5               71 6.8          5.4                        67             1
## 126        NW_IPA    483 2.5   9.7  NA 4.5               76 6.6          5.8                        64             1
## 127        NW_IPA    452 2.4  10.3  NA 4.5               79 6.9          5.3                        68             1
## 128        NW_IPA    458 2.4  10.9  NA 4.5               76 7.0          5.3                        68             1
## 129        NW_IPA    440 2.4  14.5  NA 4.5               73 7.0          5.4                        68             1
## 130        NW_IPA    464 2.4  10.1  12 4.4               75 7.1          5.2                        69             1
## 131        NW_IPA    463 2.4  10.3  NA 4.4               79 6.9          5.2                        68             1
## 132        NW_IPA    471 2.5   9.6  NA 4.4               74 7.0          5.1                        69             1
## 133        NW_IPA    473 2.5  10.0  NA 4.4               74 7.0          5.0                        69             1
## 134        NW_IPA    471 2.4   9.5  NA 4.4               74 7.2          5.0                        70             1
## 135        NW_IPA    480 2.4   9.8  NA 4.5               81 7.0          5.1                        69             1
## 136       Red_IPA    328 2.5  13.6  74 4.4               75 7.2          5.8                        66             1
## 137       Red_IPA    104 2.5  13.6  81 4.4               71 7.4          5.8                        67             1
## 138       Red_IPA    495 2.4  12.0  82 4.5               65 7.4          5.6                        68             1
## 139       Red_IPA    177 2.4  11.9  80 4.5               69 7.3          5.6                        68             1
## 140       Red_IPA    255 2.5  12.3  94 4.5               65 7.0          5.5                        67             1
## 141       Red_IPA    343 2.4  11.4  87 4.3               63 7.0          5.6                        67             1
## 142       Red_IPA    324 2.4  12.0  92 4.4               65 6.9          5.9                        65             1
## 143       Red_IPA    332 2.4  11.6  91 4.4               65 7.1          5.7                        67             1
## 144       Red_IPA    339 2.4  11.7  93 4.5               65 6.9          5.6                        67             1
## 145       Red_IPA    155 2.4  11.6  95 4.5               66 6.9          5.7                        66             1
## 146       Red_IPA    160 2.4  12.5  96 4.4               72 7.0          5.9                        66             1
## 147       Red_IPA    421 2.4  11.0  92 4.5               61 7.0          6.0                        65             1
## 148       Red_IPA    330 2.4  12.2  98 4.3               69 7.0          5.7                        66             1
## 149       Red_IPA    476 2.4  12.9  93 4.5               69 6.9          6.0                        65             1
## 150       Red_IPA    384 2.4  11.3  98 4.3               73 7.3          5.4                        69             1
## 151       Red_IPA    425 2.4  11.8  92 4.5               69 7.0          5.7                        66             1
## 152       Red_IPA    489 2.4  11.4  95 4.4               63 7.2          5.6                        67             1
## 153       Red_IPA    337 2.4  11.8  91 4.5               63 7.3          5.5                        68             1
## 154       Red_IPA    489 2.5  12.0  93 4.5               65 6.7          6.1                        64             1
## 155       Red_IPA    424 2.5  10.4  90 4.2               61 7.3          5.3                        69             1
## 156       Red_IPA    235 2.4  11.4  87 4.5               60 7.1          5.8                        66             1
## 157       Red_IPA    422 2.4  12.1  98 4.5               75 7.2          5.6                        68             1
## 158       Red_IPA    248 2.4  11.6  95 4.4               62 7.2          5.5                        68             1
## 159       Red_IPA    350 2.4  11.5  95 4.5               67 7.3          5.6                        68             1
## 160       Red_IPA     76 2.4  12.3  99 4.4               50 7.6          5.8                        68             1
## 161       Red_IPA    421 2.4  10.9  95 4.3               61 6.9          5.3                        68             1
## 162       Red_IPA    178 2.4  15.3  73 4.5               54 7.1          5.6                        67             1
## 163       Red_IPA    164 2.4  12.1  75 4.4               61 6.1          6.9                        58             1
## 164       Red_IPA    240 2.4  10.9  89 4.4               57 6.8          5.7                        66             1
## 165       Red_IPA    303 2.4  11.9  96 4.5               61 6.9          6.0                        65             1
## 166       Red_IPA    256 2.4  11.6  90 4.3               54 7.4          4.0                        70             1
## 167       Red_IPA    439 2.4  11.7  91 4.5               58 6.4          6.4                        61             1
## 168       Red_IPA    173 2.4  13.3  89 4.5               58 6.2          6.9                        59             1
## 169       Red_IPA    121 2.4  11.5  NA 4.5               60 6.8          6.1                        64             1
## 170       Red_IPA     57 2.4  11.4  NA 4.5               60 6.8          6.2                        64             1
## 171       Red_IPA     NA  NA  13.4  NA  NA               63 6.8          5.8                        65             1
## 172       Red_IPA     NA  NA  13.6  NA  NA               63 6.8          5.8                        65             1
## 173       Red_IPA    357 2.4  12.9   7 4.5               61 6.8          6.0                        65             1
## 174       Red_IPA    473 2.5  12.8  NA 4.4               59 7.2          5.3                        68             1
## 175    Citrus_IPA    449 2.4   7.5  96 4.4               86 7.1          3.7                        70             1
## 176    Citrus_IPA    437 2.4   7.2  92 4.6               90 7.3          5.6                        68             1
## 177    Citrus_IPA    428 2.4   7.2  95 4.5               82 7.2          5.8                        67             1
## 178    Citrus_IPA    305 2.4   7.0  93 4.5               82 7.3          5.5                        68             1
## 179    Citrus_IPA    297 2.4   7.1  88 4.5               70 7.2          6.0                        66             1
## 180    Citrus_IPA    155 2.4   6.8  97 4.5               80 6.6          6.3                        63             1
## 181    Citrus_IPA    149 2.4   7.0  90 4.5               78 6.8          6.0                        65             1
## 182    Citrus_IPA    312 2.4   6.9  94 4.4               72 7.1          5.7                        67             1
## 183    Citrus_IPA    482 2.5   7.0  95 4.5               77 7.0          5.7                        66             1
## 184    Citrus_IPA    294 2.4   7.0  91 4.5               78 7.0          5.8                        66             1
## 185    Citrus_IPA    294 2.4   7.0  91 4.5               73 7.2          5.4                        68             1
## 186    Citrus_IPA    314 2.4   6.7  92 4.5               72 7.2          5.4                        69             1
## 187    Citrus_IPA    316 2.4   6.7 113 4.3               85 7.0          5.5                        68             1
## 188    Citrus_IPA    155 2.4   7.8  94 4.5               87 7.2          5.8                        67             1
## 189 Oatmeal_Stout    208 2.4 109.7  80 4.3               57 7.4          8.1                        60             1
## 190 Oatmeal_Stout    234 2.4 112.0  74 4.4               49 7.1          7.8                        59             1
## 191 Oatmeal_Stout    232 2.4 121.5  85 4.4               46 6.7          7.5                        58             1
## 192 Oatmeal_Stout    316 2.4 125.9  94 4.2               47 6.5          7.8                        57             1
## 193 Oatmeal_Stout    138 2.4 123.4  85 4.3               44 6.4          7.8                        57             1
## 194 Oatmeal_Stout    213 2.4 116.8  98 4.4               43 6.8          7.7                        58             1
## 195 Oatmeal_Stout     NA  NA 131.3  NA 4.3               41 6.8          7.7                        58             1
## 196 Oatmeal_Stout    166 2.4 137.9  93 4.2               53 6.7          8.1                        57             1
## 197 Oatmeal_Stout    236 2.4  96.5  91 4.4               55 7.2          7.2                        62             1
## 198 Oatmeal_Stout    165 2.4 115.6  98 4.3               55 6.5          7.8                        57             1
## 199 Oatmeal_Stout    235 2.4 114.8  87 4.3               53 7.3          7.4                        61             1
## 200 Oatmeal_Stout    218 2.4 118.1  91 4.4               53 6.6          7.9                        57             1
## 201   Oktoberfest    276 2.6   6.0  98 4.6               39 6.2          4.4                        70             1
## 202   Oktoberfest    407 2.6   6.9  90 4.8               39 6.1          4.7                        68             1
## 203   Oktoberfest    454 2.6   6.4  77 4.7               37 6.1          4.6                        68             1
## 204   Oktoberfest    130 2.6   6.7  98 4.8               42 6.0          4.6                        68             1
## 205      Pale_Ale     56 2.4  13.2  80 4.2               42 5.7          5.4                        63             1
## 206      Pale_Ale     53 2.4  12.8  90 4.3               53 5.2          5.5                        60             1
## 207      Pale_Ale     55 2.4  11.7  93 4.2               49 5.3          5.5                        61             1
## 208      Pale_Ale     58 2.4  11.1  91 4.2               51 5.2          5.6                        60             1
## 209      Pale_Ale     47 2.4  13.0  95 4.4               56 5.9          6.0                        61             1
## 210      Pale_Ale     48 2.4  13.2  93 4.2               49 5.7          5.6                        62             1
## 211      Pale_Ale     44 2.4  13.0  95 4.1               47 5.5          5.3                        62             1
## 212      Pale_Ale     52 2.4  13.1  NA 4.3               49 5.6          5.3                        63             1
## 213    Winter_Ale    303 2.4  34.9  77 4.5               56 7.2          6.0                        66             1
## 214    Winter_Ale    499 2.4  36.1  79 4.5               57 7.0          6.0                        65             1
## 215    Winter_Ale    435 2.4  35.8  93 4.6               52 7.1          5.9                        66             1
## 216    Winter_Ale    436 2.4  36.6  NA 4.6               52 7.1          6.0                        66             1
## 217    Winter_Ale     48 2.4  36.7  NA 4.5               49 7.1          5.9                        66             1
## 218    Winter_Ale    133 2.4  36.5   4 4.6               56 7.1          6.1                        65             1
## 219    Winter_Ale    128 2.4  36.6   7 4.6               57 7.2          6.1                        65             1
## 220    Winter_Ale    240 2.4  36.0  NA 4.7               58 7.2          5.9                        66             1
## 221    Winter_Ale     82 2.4  36.0  NA 4.7               58 7.2          5.9                        66             1
## 222    Winter_Ale    249 2.5  37.5  NA 4.5               60 7.3          5.7                        67             1
## 223    Double_IPA    155 2.4  12.9  75 4.4               79 8.2          7.2                        65             1
## 224    Double_IPA    156 2.4  13.1  79 4.5               84 8.5          6.8                        67             1
## 225    Double_IPA    156 2.4  12.2  88 4.5               81 8.1          7.2                        65             1
## 226    Double_IPA    251 2.4  11.3  90 4.6               78 7.8          7.3                        63             1
## 227    Double_IPA    152 2.4  11.7  86 4.6               77 7.8          7.4                        63             1
## 228    Double_IPA    238 2.4  11.3  92 4.5               77 8.1          6.9                        65             1
## 229    Double_IPA    148 2.4  10.8  92 4.5               78 8.2          7.0                        65             1
## 230    Double_IPA    318 2.4  10.3  99 4.7               77 8.2          6.9                        66             1
## 231    Double_IPA    238 2.4  11.6  94 4.5               85 8.3          7.0                        66             1
## 232    Double_IPA    149 2.4  12.8  94 4.5               81 8.3          7.0                        65             1
## 233    Double_IPA    155 2.4  15.2  97 4.5               91 8.6          6.2                        69             1
## 234    Double_IPA    156 2.4  12.0  84 4.6               86 8.2          7.3                        64             1
## 235    Double_IPA    237 2.4  12.0  93 4.5               84 7.7          7.7                        61             1
## 236    Double_IPA     72 2.4  13.7  98 4.4               87 8.9          6.0                        71             1
## 237    Double_IPA    283 2.4  12.0  95 4.5               85 8.8          6.4                        69             1
## 238    Double_IPA    229 2.4  10.8  91 4.5               81 7.7          7.5                        62             1
## 239    Double_IPA    236 2.4  10.9  93 4.5               75 7.7          7.6                        62             1
## 240    Double_IPA    316 2.4  12.2  89 4.4               75 8.2          6.8                        66             1
## 241    Double_IPA    394 2.4  11.8  88 4.4               63 8.2          6.7                        67             1
## 242    Double_IPA    365 2.4  14.0  78 4.5               70 8.0          7.2                        64             1
## 243    Double_IPA    464 2.4  12.2  91 4.7               72 8.2          6.8                        66             1
## 244    Double_IPA    391 2.4  12.1  NA 4.6               63 7.7          7.1                        64             1
## 245    Double_IPA    328 2.5  11.8  NA 4.5               76 7.9          6.9                        65             1
## 246 Vanilla_Stout    128 2.4 116.0  84 4.4               48 7.1          7.8                        59             1
## 247 Vanilla_Stout    117 2.4 116.8  90 4.4               46 6.7          7.8                        58             1
## 248 Vanilla_Stout    125 2.5 131.2  95 4.2               46 6.5          7.8                        57             1
## 249 Vanilla_Stout    100 2.4 114.8  95 4.3               53 7.3          7.4                        61             1
## 250 Vanilla_Stout     NA  NA 116.0  NA 4.3               50 7.9          6.9                        65             1
## 251 Vanilla_Stout    224 2.4 113.7  98 4.3               46 7.9          6.9                        65             1
## 252 Vanilla_Stout    230 2.4 120.0  83 4.4               37 8.0          7.1                        65             1
## 253 Vanilla_Stout    180 2.4 112.0  NA 4.6               47 6.5          8.0                        57             1
## 254 Vanilla_Stout    305 2.5 104.6  NA 4.6               46 6.8          7.7                        58             1
## 255 Vanilla_Stout    107 2.4 111.9  NA 4.4               45 7.2          6.7                        63             1
## 256 Vanilla_Stout    120 2.4 109.6  NA 4.4               46 7.1          6.8                        63             1
## 257 Vanilla_Stout    120 2.4 110.6  NA 4.4               46 7.1          6.8                        63             1

Goals:

  1. Describe major axes of variation between beer batches.
  2. Look for variation not related to taste/style.
  3. Make a pretty plot.

plot of chunk r beerpairs

PCA: how to do it

Practical considerations

  1. Only use numeric variables - omit factors.
  2. Variables should all be on the same scale, at least roughly.
  3. Variables should also probably be centered.

… however,

  1. Beware outliers.
  2. Transformations may be a good idea.
  3. Missing data must be removed or imputed.
  4. Consider replacing highly skewed variables with ranks.

?princomp

princomp                 package:stats                 R Documentation

Principal Components Analysis

Description:

     ‘princomp’ performs a principal components analysis on the given
     numeric data matrix and returns the results as an object of class
     ‘princomp’.

Usage:

     princomp(x, ...)
     
     ## S3 method for class 'formula'
     princomp(formula, data = NULL, subset, na.action, ...)
     
     ## Default S3 method:
     princomp(x, cor = FALSE, scores = TRUE, covmat = NULL,
              subset = rep_len(TRUE, nrow(as.matrix(x))), fix_sign = TRUE, ...)
     
     ## S3 method for class 'princomp'
     predict(object, newdata, ...)
     
Arguments:

 formula: a formula with no response variable, referring only to
          numeric variables.

    data: an optional data frame (or similar: see ‘model.frame’)
          containing the variables in the formula ‘formula’.  By
          default the variables are taken from ‘environment(formula)’.

  subset: an optional vector used to select rows (observations) of the
          data matrix ‘x’.

na.action: a function which indicates what should happen when the data
          contain ‘NA’s.  The default is set by the ‘na.action’ setting
          of ‘options’, and is ‘na.fail’ if that is unset. The
          ‘factory-fresh’ default is ‘na.omit’.

       x: a numeric matrix or data frame which provides the data for
          the principal components analysis.

     cor: a logical value indicating whether the calculation should use
          the correlation matrix or the covariance matrix.  (The
          correlation matrix can only be used if there are no constant
          variables.)

  scores: a logical value indicating whether the score on each
          principal component should be calculated.

  covmat: a covariance matrix, or a covariance list as returned by
          ‘cov.wt’ (and ‘cov.mve’ or ‘cov.mcd’ from package ‘MASS’).
          If supplied, this is used rather than the covariance matrix
          of ‘x’.

fix_sign: Should the signs of the loadings and scores be chosen so that
          the first element of each loading is non-negative?

     ...: arguments passed to or from other methods. If ‘x’ is a
          formula one might specify ‘cor’ or ‘scores’.

  object: Object of class inheriting from ‘"princomp"’.

 newdata: An optional data frame or matrix in which to look for
          variables with which to predict.  If omitted, the scores are
          used.  If the original fit used a formula or a data frame or
          a matrix with column names, ‘newdata’ must contain columns
          with the same names. Otherwise it must contain the same
          number of columns, to be used in the same order.

Details:

     ‘princomp’ is a generic function with ‘"formula"’ and ‘"default"’
     methods.

     The calculation is done using ‘eigen’ on the correlation or
     covariance matrix, as determined by ‘cor’.  This is done for
     compatibility with the S-PLUS result.  A preferred method of
     calculation is to use ‘svd’ on ‘x’, as is done in ‘prcomp’.

     Note that the default calculation uses divisor ‘N’ for the
     covariance matrix.

     The ‘print’ method for these objects prints the results in a nice
     format and the ‘plot’ method produces a scree plot (‘screeplot’).
     There is also a ‘biplot’ method.

     If ‘x’ is a formula then the standard NA-handling is applied to
     the scores (if requested): see ‘napredict’.

     ‘princomp’ only handles so-called R-mode PCA, that is feature
     extraction of variables.  If a data matrix is supplied (possibly
     via a formula) it is required that there are at least as many
     units as variables.  For Q-mode PCA use ‘prcomp’.

Value:

     ‘princomp’ returns a list with class ‘"princomp"’ containing the
     following components:

    sdev: the standard deviations of the principal components.

loadings: the matrix of variable loadings (i.e., a matrix whose columns
          contain the eigenvectors).  This is of class ‘"loadings"’:
          see ‘loadings’ for its ‘print’ method.

  center: the means that were subtracted.

   scale: the scalings applied to each variable.

   n.obs: the number of observations.

  scores: if ‘scores = TRUE’, the scores of the supplied data on the
          principal components.  These are non-null only if ‘x’ was
          supplied, and if ‘covmat’ was also supplied if it was a
          covariance list.  For the formula method, ‘napredict()’ is
          applied to handle the treatment of values omitted by the
          ‘na.action’.

    call: the matched call.

na.action: If relevant.

Note:

     The signs of the columns of the loadings and scores are arbitrary,
     and so may differ between different programs for PCA, and even
     between different builds of R: ‘fix_sign = TRUE’ alleviates that.

References:

     Mardia, K. V., J. T. Kent and J. M. Bibby (1979).  _Multivariate
     Analysis_, London: Academic Press.

     Venables, W. N. and B. D. Ripley (2002).  _Modern Applied
     Statistics with S_, Springer-Verlag.

See Also:

     ‘summary.princomp’, ‘screeplot’, ‘biplot.princomp’, ‘prcomp’,
     ‘cor’, ‘cov’, ‘eigen’.

beer_pc <- princomp(na.omit(beer[, beer_vars]), cor=TRUE, scores=TRUE)
str(beer_pc)
## List of 7
##  $ sdev    : Named num [1:10] 2.05 1.441 1.118 0.965 0.825 ...
##   ..- attr(*, "names")= chr [1:10] "Comp.1" "Comp.2" "Comp.3" "Comp.4" ...
##  $ loadings: 'loadings' num [1:10, 1:10] 0.0194 -0.3338 0.2763 -0.0465 0.1402 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:10] "Volume" "CO2" "Color" "DO" ...
##   .. ..$ : chr [1:10] "Comp.1" "Comp.2" "Comp.3" "Comp.4" ...
##  $ center  : Named num [1:10] 306.47 2.43 18.12 89.8 4.4 ...
##   ..- attr(*, "names")= chr [1:10] "Volume" "CO2" "Color" "DO" ...
##  $ scale   : Named num [1:10] 134.589 0.0627 29.1922 13.5223 0.1113 ...
##   ..- attr(*, "names")= chr [1:10] "Volume" "CO2" "Color" "DO" ...
##  $ n.obs   : int 223
##  $ scores  : num [1:223, 1:10] -2.98 -3.35 -2.84 -2.77 -2.36 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:223] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:10] "Comp.1" "Comp.2" "Comp.3" "Comp.4" ...
##  $ call    : language princomp(x = na.omit(beer[, beer_vars]), cor = TRUE, scores = TRUE)
##  - attr(*, "class")= chr "princomp"

What it means

> str(beer_pc)
List of 7
  • $ sdev: standard deviations of the PCs
  • $ loadings: loadings of the variables on each PC
  • $ center: subtracted from columns of the data
  • $ scale: divided from columns of the data
  • $ scores: the actual PCs

How many PCs should I use?

  • PCA gives you as many PCs as variables.

Depends - what for?

Some answers:

  1. Visualization: as many as make sense (but beware overinterpretation).
  2. Further analysis: as many as your method can handle.
  3. Until an obvious break in the scree plot.

The scree plot

Shows the variance explained by each PC (these are always decreasing).

plot(beer_pc)

plot of chunk r scree

The top four PCs (plot by biplot( ))

plot of chunk r plot_prcomp_beer

plot of chunk r plot_b1

plot of chunk r plot_b2

Gotchas

  • PCs are only well-defined up to sign (\(\pm\)) and scale.
  • Some programs report PCs normalized to have SD 1.
  • Many ways to do basically the same thing: e.g., prcomp and princomp.

Your turn: wine

Using this dataset of chemical concentrations in wine:

(wine <- read.table("data/wine.tsv", header=TRUE))
##     Vineyard C1   C2  C3 C4  C5   C6   C7   C8   C9  C10  C11 C12  C13
## 1          1 14 1.71 2.4 16 127 2.80 3.06 0.28 2.29  5.6 1.04 3.9 1065
## 2          1 13 1.78 2.1 11 100 2.65 2.76 0.26 1.28  4.4 1.05 3.4 1050
## 3          1 13 2.36 2.7 19 101 2.80 3.24 0.30 2.81  5.7 1.03 3.2 1185
## 4          1 14 1.95 2.5 17 113 3.85 3.49 0.24 2.18  7.8 0.86 3.5 1480
## 5          1 13 2.59 2.9 21 118 2.80 2.69 0.39 1.82  4.3 1.04 2.9  735
## 6          1 14 1.76 2.5 15 112 3.27 3.39 0.34 1.97  6.8 1.05 2.9 1450
## 7          1 14 1.87 2.5 15  96 2.50 2.52 0.30 1.98  5.2 1.02 3.6 1290
## 8          1 14 2.15 2.6 18 121 2.60 2.51 0.31 1.25  5.0 1.06 3.6 1295
## 9          1 15 1.64 2.2 14  97 2.80 2.98 0.29 1.98  5.2 1.08 2.9 1045
## 10         1 14 1.35 2.3 16  98 2.98 3.15 0.22 1.85  7.2 1.01 3.5 1045
## 11         1 14 2.16 2.3 18 105 2.95 3.32 0.22 2.38  5.8 1.25 3.2 1510
## 12         1 14 1.48 2.3 17  95 2.20 2.43 0.26 1.57  5.0 1.17 2.8 1280
## 13         1 14 1.73 2.4 16  89 2.60 2.76 0.29 1.81  5.6 1.15 2.9 1320
## 14         1 15 1.73 2.4 11  91 3.10 3.69 0.43 2.81  5.4 1.25 2.7 1150
## 15         1 14 1.87 2.4 12 102 3.30 3.64 0.29 2.96  7.5 1.20 3.0 1547
## 16         1 14 1.81 2.7 17 112 2.85 2.91 0.30 1.46  7.3 1.28 2.9 1310
## 17         1 14 1.92 2.7 20 120 2.80 3.14 0.33 1.97  6.2 1.07 2.6 1280
## 18         1 14 1.57 2.6 20 115 2.95 3.40 0.40 1.72  6.6 1.13 2.6 1130
## 19         1 14 1.59 2.5 16 108 3.30 3.93 0.32 1.86  8.7 1.23 2.8 1680
## 20         1 14 3.10 2.6 15 116 2.70 3.03 0.17 1.66  5.1 0.96 3.4  845
## 21         1 14 1.63 2.3 16 126 3.00 3.17 0.24 2.10  5.7 1.09 3.7  780
## 22         1 13 3.80 2.6 19 102 2.41 2.41 0.25 1.98  4.5 1.03 3.5  770
## 23         1 14 1.86 2.4 17 101 2.61 2.88 0.27 1.69  3.8 1.11 4.0 1035
## 24         1 13 1.60 2.5 18  95 2.48 2.37 0.26 1.46  3.9 1.09 3.6 1015
## 25         1 14 1.81 2.6 20  96 2.53 2.61 0.28 1.66  3.5 1.12 3.8  845
## 26         1 13 2.05 3.2 25 124 2.63 2.68 0.47 1.92  3.6 1.13 3.2  830
## 27         1 13 1.77 2.6 16  93 2.85 2.94 0.34 1.45  4.8 0.92 3.2 1195
## 28         1 13 1.72 2.1 17  94 2.40 2.19 0.27 1.35  4.0 1.02 2.8 1285
## 29         1 14 1.90 2.8 19 107 2.95 2.97 0.37 1.76  4.5 1.25 3.4  915
## 30         1 14 1.68 2.2 16  96 2.65 2.33 0.26 1.98  4.7 1.04 3.6 1035
## 31         1 14 1.50 2.7 22 101 3.00 3.25 0.29 2.38  5.7 1.19 2.7 1285
## 32         1 14 1.66 2.4 19 106 2.86 3.19 0.22 1.95  6.9 1.09 2.9 1515
## 33         1 14 1.83 2.4 17 104 2.42 2.69 0.42 1.97  3.8 1.23 2.9  990
## 34         1 14 1.53 2.7 20 132 2.95 2.74 0.50 1.35  5.4 1.25 3.0 1235
## 35         1 14 1.80 2.6 19 110 2.35 2.53 0.29 1.54  4.2 1.10 2.9 1095
## 36         1 13 1.81 2.4 20 100 2.70 2.98 0.26 1.86  5.1 1.04 3.5  920
## 37         1 13 1.64 2.8 16 110 2.60 2.68 0.34 1.36  4.6 1.09 2.8  880
## 38         1 13 1.65 2.5 18  98 2.45 2.43 0.29 1.44  4.2 1.12 2.5 1105
## 39         1 13 1.50 2.1 16  98 2.40 2.64 0.28 1.37  3.7 1.18 2.7 1020
## 40         1 14 3.99 2.5 13 128 3.00 3.04 0.20 2.08  5.1 0.89 3.5  760
## 41         1 14 1.71 2.3 16 117 3.15 3.29 0.34 2.34  6.1 0.95 3.4  795
## 42         1 13 3.84 2.1 19  90 2.45 2.68 0.27 1.48  4.3 0.91 3.0 1035
## 43         1 14 1.89 2.6 15 101 3.25 3.56 0.17 1.70  5.4 0.88 3.6 1095
## 44         1 13 3.98 2.3 18 103 2.64 2.63 0.32 1.66  4.4 0.82 3.0  680
## 45         1 13 1.77 2.1 17 107 3.00 3.00 0.28 2.03  5.0 0.88 3.4  885
## 46         1 14 4.04 2.4 19 111 2.85 2.65 0.30 1.25  5.2 0.87 3.3 1080
## 47         1 14 3.59 2.3 16 102 3.25 3.17 0.27 2.19  4.9 1.04 3.4 1065
## 48         1 14 1.68 2.1 16 101 3.10 3.39 0.21 2.14  6.1 0.91 3.3  985
## 49         1 14 2.02 2.4 19 103 2.75 2.92 0.32 2.38  6.2 1.07 2.8 1060
## 50         1 14 1.73 2.3 17 108 2.88 3.54 0.32 2.08  8.9 1.12 3.1 1260
## 51         1 13 1.73 2.0 12  92 2.72 3.27 0.17 2.91  7.2 1.12 2.9 1150
## 52         1 14 1.65 2.6 17  94 2.45 2.99 0.22 2.29  5.6 1.24 3.4 1265
## 53         1 14 1.75 2.4 14 111 3.88 3.74 0.32 1.87  7.0 1.01 3.3 1190
## 54         1 14 1.90 2.7 17 115 3.00 2.79 0.39 1.68  6.3 1.13 2.9 1375
## 55         1 14 1.67 2.2 16 118 2.60 2.90 0.21 1.62  5.8 0.92 3.2 1060
## 56         1 14 1.73 2.5 20 116 2.96 2.78 0.20 2.45  6.2 0.98 3.0 1120
## 57         1 14 1.70 2.3 16 118 3.20 3.00 0.26 2.03  6.4 0.94 3.3  970
## 58         1 13 1.97 2.7 17 102 3.00 3.23 0.31 1.66  6.0 1.07 2.8 1270
## 59         1 14 1.43 2.5 17 108 3.40 3.67 0.19 2.04  6.8 0.89 2.9 1285
## 60         2 12 0.94 1.4 11  88 1.98 0.57 0.28 0.42  1.9 1.05 1.8  520
## 61         2 12 1.10 2.3 16 101 2.05 1.09 0.63 0.41  3.3 1.25 1.7  680
## 62         2 13 1.36 2.0 17 100 2.02 1.41 0.53 0.62  5.8 0.98 1.6  450
## 63         2 14 1.25 1.9 18  94 2.10 1.79 0.32 0.73  3.8 1.23 2.5  630
## 64         2 12 1.13 2.2 19  87 3.50 3.10 0.19 1.87  4.5 1.22 2.9  420
## 65         2 12 1.45 2.5 19 104 1.89 1.75 0.45 1.03  3.0 1.45 2.2  355
## 66         2 12 1.21 2.6 18  98 2.42 2.65 0.37 2.08  4.6 1.19 2.3  678
## 67         2 13 1.01 1.7 15  78 2.98 3.18 0.26 2.28  5.3 1.12 3.2  502
## 68         2 12 1.17 1.9 20  78 2.11 2.00 0.27 1.04  4.7 1.12 3.5  510
## 69         2 13 0.94 2.4 17 110 2.53 1.30 0.55 0.42  3.2 1.02 1.9  750
## 70         2 12 1.19 1.8 17 151 1.85 1.28 0.14 2.50  2.9 1.28 3.1  718
## 71         2 12 1.61 2.2 20 103 1.10 1.02 0.37 1.46  3.0 0.91 1.8  870
## 72         2 14 1.51 2.7 25  86 2.95 2.86 0.21 1.87  3.4 1.36 3.2  410
## 73         2 13 1.66 2.2 24  87 1.88 1.84 0.27 1.03  3.7 0.98 2.8  472
## 74         2 13 1.67 2.6 30 139 3.30 2.89 0.21 1.96  3.4 1.31 3.5  985
## 75         2 12 1.09 2.3 21 101 3.38 2.14 0.13 1.65  3.2 0.99 3.1  886
## 76         2 12 1.88 1.9 16  97 1.61 1.57 0.34 1.15  3.8 1.23 2.1  428
## 77         2 13 0.90 1.7 16  86 1.95 2.03 0.24 1.46  4.6 1.19 2.5  392
## 78         2 12 2.89 2.2 18 112 1.72 1.32 0.43 0.95  2.6 0.96 2.5  500
## 79         2 12 0.99 1.9 15 136 1.90 1.85 0.35 2.76  3.4 1.06 2.3  750
## 80         2 13 3.87 2.4 23 101 2.83 2.55 0.43 1.95  2.6 1.19 3.1  463
## 81         2 12 0.92 2.0 19  86 2.42 2.26 0.30 1.43  2.5 1.38 3.1  278
## 82         2 13 1.81 2.2 19  86 2.20 2.53 0.26 1.77  3.9 1.16 3.1  714
## 83         2 12 1.13 2.5 24  78 2.00 1.58 0.40 1.40  2.2 1.31 2.7  630
## 84         2 13 3.86 2.3 22  85 1.65 1.59 0.61 1.62  4.8 0.84 2.0  515
## 85         2 12 0.89 2.6 18  94 2.20 2.21 0.22 2.35  3.0 0.79 3.1  520
## 86         2 13 0.98 2.2 18  99 2.20 1.94 0.30 1.46  2.6 1.23 3.2  450
## 87         2 12 1.61 2.3 23  90 1.78 1.69 0.43 1.56  2.5 1.33 2.3  495
## 88         2 12 1.67 2.6 26  88 1.92 1.61 0.40 1.34  2.6 1.36 3.2  562
## 89         2 12 2.06 2.5 22  84 1.95 1.69 0.48 1.35  2.8 1.00 2.8  680
## 90         2 12 1.33 2.3 24  70 2.20 1.59 0.42 1.38  1.7 1.07 3.2  625
## 91         2 12 1.83 2.3 18  81 1.60 1.50 0.52 1.64  2.4 1.08 2.3  480
## 92         2 12 1.51 2.4 22  86 1.45 1.25 0.50 1.63  3.6 1.05 2.6  450
## 93         2 13 1.53 2.3 21  80 1.38 1.46 0.58 1.62  3.0 0.96 2.1  495
## 94         2 12 2.83 2.2 18  88 2.45 2.25 0.25 1.99  2.1 1.15 3.3  290
## 95         2 12 1.99 2.3 18  98 3.02 2.26 0.17 1.35  3.2 1.16 3.0  345
## 96         2 12 1.52 2.2 19 162 2.50 2.27 0.32 3.28  2.6 1.16 2.6  937
## 97         2 12 2.12 2.7 22 134 1.60 0.99 0.14 1.56  2.5 0.95 2.3  625
## 98         2 12 1.41 2.0 16  85 2.55 2.50 0.29 1.77  2.9 1.23 2.7  428
## 99         2 12 1.07 2.1 18  88 3.52 3.75 0.24 1.95  4.5 1.04 2.8  660
## 100        2 12 3.17 2.2 18  88 2.85 2.99 0.45 2.81  2.3 1.42 2.8  406
## 101        2 12 2.08 1.7 18  97 2.23 2.17 0.26 1.40  3.3 1.27 3.0  710
## 102        2 13 1.34 1.9 18  88 1.45 1.36 0.29 1.35  2.5 1.04 2.8  562
## 103        2 12 2.45 2.5 21  98 2.56 2.11 0.34 1.31  2.8 0.80 3.4  438
## 104        2 12 1.72 1.9 20  86 2.50 1.64 0.37 1.42  2.1 0.94 2.4  415
## 105        2 13 1.73 2.0 20  85 2.20 1.92 0.32 1.48  2.9 1.04 3.6  672
## 106        2 12 2.55 2.3 22  90 1.68 1.84 0.66 1.42  2.7 0.86 3.3  315
## 107        2 12 1.73 2.1 19  80 1.65 2.03 0.37 1.63  3.4 1.00 3.2  510
## 108        2 13 1.75 2.3 22  84 1.38 1.76 0.48 1.63  3.3 0.88 2.4  488
## 109        2 12 1.29 1.9 19  92 2.36 2.04 0.39 2.08  2.7 0.86 3.0  312
## 110        2 12 1.35 2.7 20  94 2.74 2.92 0.29 2.49  2.6 0.96 3.3  680
## 111        2 11 3.74 1.8 20 107 3.18 2.58 0.24 3.58  2.9 0.75 2.8  562
## 112        2 13 2.43 2.2 21  88 2.55 2.27 0.26 1.22  2.0 0.90 2.8  325
## 113        2 12 2.68 2.9 20 103 1.75 2.03 0.60 1.05  3.8 1.23 2.5  607
## 114        2 11 0.74 2.5 21  88 2.48 2.01 0.42 1.44  3.1 1.10 2.3  434
## 115        2 12 1.39 2.5 22  84 2.56 2.29 0.43 1.04  2.9 0.93 3.2  385
## 116        2 11 1.51 2.2 22  85 2.46 2.17 0.52 2.01  1.9 1.71 2.9  407
## 117        2 12 1.47 2.0 21  86 1.98 1.60 0.30 1.53  1.9 0.95 3.3  495
## 118        2 12 1.61 2.2 22 108 2.00 2.09 0.34 1.61  2.1 1.06 3.0  345
## 119        2 13 3.43 2.0 16  80 1.63 1.25 0.43 0.83  3.4 0.70 2.1  372
## 120        2 12 3.43 2.0 19  87 2.00 1.64 0.37 1.87  1.3 0.93 3.0  564
## 121        2 11 2.40 2.4 20  96 2.90 2.79 0.32 1.83  3.2 0.80 3.4  625
## 122        2 12 2.05 3.2 28 119 3.18 5.08 0.47 1.87  6.0 0.93 3.7  465
## 123        2 12 4.43 2.7 26 102 2.20 2.13 0.43 1.71  2.1 0.92 3.1  365
## 124        2 13 5.80 2.1 22  86 2.62 2.65 0.30 2.01  2.6 0.73 3.1  380
## 125        2 12 4.31 2.4 21  82 2.86 3.03 0.21 2.91  2.8 0.75 3.6  380
## 126        2 12 2.16 2.2 21  85 2.60 2.65 0.37 1.35  2.8 0.86 3.3  378
## 127        2 12 1.53 2.3 22  86 2.74 3.15 0.39 1.77  3.9 0.69 2.8  352
## 128        2 12 2.13 2.8 28  92 2.13 2.24 0.58 1.76  3.0 0.97 2.4  466
## 129        2 12 1.63 2.3 24  88 2.22 2.45 0.40 1.90  2.1 0.89 2.8  342
## 130        2 12 4.30 2.4 22  80 2.10 1.75 0.42 1.35  2.6 0.79 2.6  580
## 131        3 13 1.35 2.3 18 122 1.51 1.25 0.21 0.94  4.1 0.76 1.3  630
## 132        3 13 2.99 2.4 20 104 1.30 1.22 0.24 0.83  5.4 0.74 1.4  530
## 133        3 13 2.31 2.4 24  98 1.15 1.09 0.27 0.83  5.7 0.66 1.4  560
## 134        3 13 3.55 2.4 22 106 1.70 1.20 0.17 0.84  5.0 0.78 1.3  600
## 135        3 13 1.24 2.2 18  85 2.00 0.58 0.60 1.25  5.5 0.75 1.5  650
## 136        3 13 2.46 2.2 18  94 1.62 0.66 0.63 0.94  7.1 0.73 1.6  695
## 137        3 12 4.72 2.5 21  89 1.38 0.47 0.53 0.80  3.9 0.75 1.3  720
## 138        3 13 5.51 2.6 25  96 1.79 0.60 0.63 1.10  5.0 0.82 1.7  515
## 139        3 13 3.59 2.2 20  88 1.62 0.48 0.58 0.88  5.7 0.81 1.8  580
## 140        3 13 2.96 2.6 24 101 2.32 0.60 0.53 0.81  4.9 0.89 2.1  590
## 141        3 13 2.81 2.7 21  96 1.54 0.50 0.53 0.75  4.6 0.77 2.3  600
## 142        3 13 2.56 2.4 20  89 1.40 0.50 0.37 0.64  5.6 0.70 2.5  780
## 143        3 14 3.17 2.7 24  97 1.55 0.52 0.50 0.55  4.3 0.89 2.1  520
## 144        3 14 4.95 2.4 20  92 2.00 0.80 0.47 1.02  4.4 0.91 2.0  550
## 145        3 12 3.88 2.2 18 112 1.38 0.78 0.29 1.14  8.2 0.65 2.0  855
## 146        3 13 3.57 2.1 21 102 1.50 0.55 0.43 1.30  4.0 0.60 1.7  830
## 147        3 14 5.04 2.2 20  80 0.98 0.34 0.40 0.68  4.9 0.58 1.3  415
## 148        3 13 4.61 2.5 22  86 1.70 0.65 0.47 0.86  7.7 0.54 1.9  625
## 149        3 13 3.24 2.4 22  92 1.93 0.76 0.45 1.25  8.4 0.55 1.6  650
## 150        3 13 3.90 2.4 22 113 1.41 1.39 0.34 1.14  9.4 0.57 1.3  550
## 151        3 14 3.12 2.6 24 123 1.40 1.57 0.22 1.25  8.6 0.59 1.3  500
## 152        3 13 2.67 2.5 22 112 1.48 1.36 0.24 1.26 10.8 0.48 1.5  480
## 153        3 13 1.90 2.8 26 116 2.20 1.28 0.26 1.56  7.1 0.61 1.3  425
## 154        3 13 3.30 2.3 18  98 1.80 0.83 0.61 1.87 10.5 0.56 1.5  675
## 155        3 13 1.29 2.1 20 103 1.48 0.58 0.53 1.40  7.6 0.58 1.6  640
## 156        3 13 5.19 2.3 22  93 1.74 0.63 0.61 1.55  7.9 0.60 1.5  725
## 157        3 14 4.12 2.4 20  89 1.80 0.83 0.48 1.56  9.0 0.57 1.6  480
## 158        3 12 3.03 2.6 27  97 1.90 0.58 0.63 1.14  7.5 0.67 1.7  880
## 159        3 14 1.68 2.7 25  98 2.80 1.31 0.53 2.70 13.0 0.57 2.0  660
## 160        3 13 1.67 2.6 22  89 2.60 1.10 0.52 2.29 11.8 0.57 1.8  620
## 161        3 12 3.83 2.4 21  88 2.30 0.92 0.50 1.04  7.7 0.56 1.6  520
## 162        3 14 3.26 2.5 20 107 1.83 0.56 0.50 0.80  5.9 0.96 1.8  680
## 163        3 13 3.27 2.6 22 106 1.65 0.60 0.60 0.96  5.6 0.87 2.1  570
## 164        3 13 3.45 2.4 18 106 1.39 0.70 0.40 0.94  5.3 0.68 1.8  675
## 165        3 14 2.76 2.3 22  90 1.35 0.68 0.41 1.03  9.6 0.70 1.7  615
## 166        3 14 4.36 2.3 22  88 1.28 0.47 0.52 1.15  6.6 0.78 1.8  520
## 167        3 13 3.70 2.6 23 111 1.70 0.92 0.43 1.46 10.7 0.85 1.6  695
## 168        3 13 3.37 2.3 20  88 1.48 0.66 0.40 0.97 10.3 0.72 1.8  685
## 169        3 14 2.58 2.7 24 105 1.55 0.84 0.39 1.54  8.7 0.74 1.8  750
## 170        3 13 4.60 2.9 25 112 1.98 0.96 0.27 1.11  8.5 0.67 1.9  630
## 171        3 12 3.03 2.3 19  96 1.25 0.49 0.40 0.73  5.5 0.66 1.8  510
## 172        3 13 2.39 2.3 20  86 1.39 0.51 0.48 0.64  9.9 0.57 1.6  470
## 173        3 14 2.51 2.5 20  91 1.68 0.70 0.44 1.24  9.7 0.62 1.7  660
## 174        3 14 5.65 2.5 20  95 1.68 0.61 0.52 1.06  7.7 0.64 1.7  740
## 175        3 13 3.91 2.5 23 102 1.80 0.75 0.43 1.41  7.3 0.70 1.6  750
## 176        3 13 4.28 2.3 20 120 1.59 0.69 0.43 1.35 10.2 0.59 1.6  835
## 177        3 13 2.59 2.4 20 120 1.65 0.68 0.53 1.46  9.3 0.60 1.6  840
## 178        3 14 4.10 2.7 24  96 2.05 0.76 0.56 1.35  9.2 0.61 1.6  560
  1. Look at the data.
  2. Do PCA, and plot the results colored by vineyard.
  3. What happens if you set cor=FALSE?
  4. Which variables most strongly differentiate the three vineyards?

IN CLASS

Text analysis

Identifying authors

In data/passages.txt we have a number of short passages from a few different books.

Can we identify the authors of each passage?

The true sources of the passages are in data/passage_sources.tsv.

Turn the data into a matrix

passages <- readLines("data/passages.txt")
sources <- read.table("data/passage_sources.tsv", header=TRUE, stringsAsFactors=TRUE)
words <- sort(unique(strsplit(paste(passages, collapse=" "), " +")[[1]]))
tabwords <- function (x, w) { tabulate(match(strsplit(x, " ")[[1]], w), nbins=length(w)) }
wordmat <- sapply(passages, tabwords, words)
dimnames(wordmat) <- list(words, NULL)
stopifnot( min(rowSums(wordmat)) > 0 )
wordmat[1:20, 1:20]
##              [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## a              15    9   12    8    7    7    5   12    8     3    18    12    13    11     2     2    12    14    11    14
## aback           0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abaft           0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abandon         0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abandoned       0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abandonment     0    0    0    0    0    1    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abased          0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abasement       0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abashed         0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abate           0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abated          0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abatement       0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abating         0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abbeyland       0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abbreviate      0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abbreviation    0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abeam           0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abednego        0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abhor           0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abhorred        0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0

PCA?

wordpcs <- prcomp(wordmat, scale.=TRUE)
layout(t(1:2))
plot(wordpcs$rotation[,1:2], col=sources$source, pch=20, xlab="PC1", ylab="PC2")
plot(wordpcs$rotation[,2:3], col=sources$source, pch=20, xlab="PC2", ylab="PC3")
legend("topright", pch=20, col=1:3, legend=levels(sources$source))

plot of chunk r wordpca

PC1 is shortness

plot(colSums(wordmat), wordpcs$rotation[,1], col=sources$source, xlab='length', ylab='PC1')

plot of chunk r wordlen

PC2 is book

##             PC1  PC2    PC3
## her        -594 -370 -320.6
## to        -1241 -324  -13.3
## i          -492 -267  369.3
## she        -387 -259 -165.8
## you        -283 -157  229.6
## not        -363 -137   43.3
## be         -348 -137   52.7
## was        -523 -136 -141.9
## had        -305  -99  -76.7
## my         -177  -93  134.8
## have       -225  -81   81.9
## it         -485  -74   93.3
## could      -142  -73  -40.5
## for        -372  -60   29.8
## your        -94  -57   80.8
## very       -123  -54   13.1
## elinor      -70  -53  -16.3
## he         -376  -50   15.6
## would      -141  -49    9.5
## me         -115  -49   95.2
## as         -403  -49   18.8
## been       -136  -48    6.6
## said        -93  -47   33.1
## am          -56  -44   42.1
## elizabeth   -61  -41  -10.9
## him        -230  -41   34.1
## no         -139  -40   15.6
## herself     -59  -38  -38.7
## do          -78  -37   41.4
## know        -54  -36   27.9
## marianne    -51  -36  -23.3
## what       -110  -36   29.3
## sister      -44  -35   -5.8
## such       -106  -34   11.4
## miss        -56  -34   -6.3
## think       -52  -32   25.8
## will        -94  -32   52.1
## much        -84  -31   -3.0
## which      -182  -31  -23.8
## any        -100  -30    8.4
## must        -77  -30   15.9
## mrs         -45  -30  -22.0
## than        -99  -29   -4.7
## should      -65  -29   17.1
## them       -139  -28  -15.4
## on         -231  -27  -10.9
## and       -1227  -27  -63.6
## mother      -38  -27  -15.2
## mr          -47  -26  -12.4
## did         -69  -25   10.9
##             PC1 PC2    PC3
## the     -1938.8 679 -15.86
## whale     -52.4  73  18.40
## in       -729.4  55   3.81
## this     -177.7  44  36.11
## a        -742.2  39  38.59
## like      -54.4  37  12.00
## upon      -59.3  37  19.91
## ahab      -24.1  34   5.49
## sea       -22.4  32   4.68
## ship      -21.8  30   4.80
## his      -446.0  30  -6.71
## old       -31.5  29  12.11
## whales    -20.0  28   7.07
## one      -124.8  28  25.79
## into      -67.1  26   4.94
## now       -92.0  23  11.09
## these     -37.9  22   5.96
## of      -1287.3  22 -61.95
## boat      -12.8  21   0.78
## white     -14.2  20   3.81
## all      -236.2  20   8.10
## ye        -16.5  20  10.40
## then      -71.9  20  18.88
## out       -60.4  19   5.68
## up        -52.1  18   3.87
## sperm     -10.9  17   4.21
## its       -55.0  17  -0.45
## down      -37.2  16   6.23
## through   -22.0  16   0.44
## round     -17.9  16   1.09
## is       -244.3  16 117.91
## captain   -12.4  16   3.89
## from     -190.3  16 -14.54
## yet       -37.1  16   6.32
## those     -30.7  16   4.79
## still     -43.7  15  -5.29
## some      -95.4  15  -1.00
## men       -18.2  15   3.79
## deck       -9.1  14   1.95
## boats      -8.0  14   0.78
## head      -20.8  14   3.29
## crew       -7.7  13   0.58
## air       -15.7  13   0.22
## thou      -10.0  13   7.44
## water      -8.5  13   3.29
## ships      -8.1  13   2.12
## seemed    -37.3  13  -4.83
## feet       -8.1  13   3.29
## stubb      -8.9  13   3.46
## over      -48.1  12   1.85

PC3 ???

##               PC1    PC2    PC3
## her        -594.3 -369.5 -320.6
## she        -386.9 -258.5 -165.8
## was        -522.8 -136.2 -141.9
## had        -304.5  -98.5  -76.7
## and       -1226.8  -27.0  -63.6
## of        -1287.3   21.7  -62.0
## their      -152.2   -3.8  -45.6
## could      -142.3  -73.1  -40.5
## herself     -59.4  -38.3  -38.7
## were       -158.0   -5.4  -30.5
## by         -239.8   -5.3  -24.0
## which      -182.1  -30.8  -23.8
## marianne    -51.0  -36.2  -23.3
## mrs         -45.2  -29.6  -22.0
## every       -82.9  -20.3  -20.0
## they       -166.9  -18.9  -17.0
## elinor      -70.5  -52.6  -16.3
## the       -1938.8  679.3  -15.9
## them       -139.3  -28.0  -15.4
## mother      -38.2  -27.0  -15.2
## from       -190.3   15.9  -14.5
## though      -73.3   -5.7  -14.5
## to        -1240.7 -323.9  -13.3
## mr          -47.2  -25.7  -12.4
## lady        -31.3  -14.4  -12.3
## jennings    -19.3  -13.6  -11.6
## being       -55.0   -5.5  -11.3
## on         -231.3  -27.4  -10.9
## elizabeth   -61.4  -40.8  -10.9
## who         -87.1  -11.2  -10.0
## house       -28.6  -11.1   -9.3
## after       -62.7  -10.0   -9.2
## day         -40.1   -5.3   -9.1
## with       -352.1  -12.6   -8.4
## mariannes    -7.4   -6.3   -8.0
## however     -36.9  -14.1   -8.0
## dashwood    -22.1  -16.0   -8.0
## gave        -16.1   -5.7   -7.6
## whom        -21.4   -6.8   -7.0
## daughter    -10.8   -8.8   -6.9
## arrival      -5.7   -4.1   -6.8
## jane        -25.1  -19.9   -6.8
## his        -446.0   29.5   -6.7
## collins     -10.3   -4.3   -6.5
## visit       -12.5   -8.6   -6.5
## sisters     -21.6  -15.1   -6.5
## miss        -56.4  -33.6   -6.3
## middleton    -8.4   -4.2   -6.2
## soon        -49.6  -21.0   -6.2
## saw         -29.9   -8.1   -6.1
##          PC1    PC2 PC3
## i       -492 -267.1 369
## you     -283 -157.0 230
## my      -177  -93.0 135
## is      -244   16.1 118
## me      -115  -48.8  95
## it      -485  -74.2  93
## have    -225  -81.3  82
## your     -94  -57.0  81
## that    -527  -21.6  71
## be      -348 -136.6  53
## but     -302  -14.7  53
## will     -94  -32.1  52
## are      -86    4.5  46
## if      -104  -20.1  45
## not     -363 -136.9  43
## am       -56  -43.6  42
## do       -78  -36.8  41
## can      -52  -16.7  39
## a       -742   39.1  39
## this    -178   44.2  36
## him     -230  -40.7  34
## said     -93  -46.6  33
## has      -57  -16.2  32
## so      -197  -21.6  30
## for     -372  -59.9  30
## we       -68    3.2  30
## may      -49   -9.7  29
## what    -110  -35.8  29
## know     -54  -36.4  28
## one     -125   28.2  26
## think    -52  -32.3  26
## myself   -24  -16.7  21
## say      -44  -15.0  20
## shall    -32  -16.4  20
## upon     -59   36.6  20
## then     -72   19.8  19
## as      -403  -48.8  19
## whale    -52   72.6  18
## should   -65  -29.2  17
## there   -110    3.9  17
## us       -32    2.5  16
## must     -77  -30.3  16
## he      -376  -49.7  16
## no      -139  -39.7  16
## cannot   -21  -10.7  15
## replied  -20  -17.4  14
## sure     -26  -20.3  13
## here     -30   11.2  13
## very    -123  -54.2  13
## our      -25    2.2  13

PC4 ???

##                PC1    PC2      PC3    PC4
## her         -594.3 -369.5 -3.2e+02 -117.4
## the        -1938.8  679.3 -1.6e+01  -93.0
## i           -491.9 -267.1  3.7e+02  -62.6
## she         -386.9 -258.5 -1.7e+02  -59.3
## you         -283.0 -157.0  2.3e+02  -54.4
## my          -177.1  -93.0  1.3e+02  -49.0
## of         -1287.3   21.7 -6.2e+01  -46.7
## it          -484.5  -74.2  9.3e+01  -30.4
## me          -114.8  -48.8  9.5e+01  -21.4
## will         -94.3  -32.1  5.2e+01  -19.3
## elinor       -70.5  -52.6 -1.6e+01  -18.9
## marianne     -51.0  -36.2 -2.3e+01  -18.8
## are          -86.0    4.5  4.6e+01  -15.9
## we           -68.2    3.2  3.0e+01  -14.9
## am           -55.9  -43.6  4.2e+01  -14.6
## they        -166.9  -18.9 -1.7e+01  -13.3
## their       -152.2   -3.8 -4.6e+01  -13.2
## jennings     -19.3  -13.6 -1.2e+01  -11.9
## is          -244.3   16.1  1.2e+02  -11.5
## herself      -59.4  -38.3 -3.9e+01  -11.3
## be          -347.8 -136.6  5.3e+01  -10.7
## every        -82.9  -20.3 -2.0e+01  -10.5
## whale        -52.4   72.6  1.8e+01  -10.2
## shall        -31.7  -16.4  2.0e+01  -10.1
## them        -139.3  -28.0 -1.5e+01   -9.7
## from        -190.3   15.9 -1.5e+01   -9.3
## miss         -56.4  -33.6 -6.3e+00   -9.0
## thing        -35.0   -7.0  7.2e+00   -8.9
## by          -239.8   -5.3 -2.4e+01   -8.5
## your         -93.5  -57.0  8.1e+01   -8.4
## must         -76.9  -30.3  1.6e+01   -8.2
## sister       -44.2  -34.8 -5.8e+00   -7.7
## mrs          -45.2  -29.6 -2.2e+01   -7.6
## mother       -38.2  -27.0 -1.5e+01   -7.4
## lucy         -16.2  -14.3 -3.2e+00   -7.0
## any         -100.3  -30.3  8.4e+00   -7.0
## myself       -24.1  -16.7  2.1e+01   -6.8
## without      -53.5  -15.4 -2.1e+00   -6.8
## its          -55.0   16.7 -4.5e-01   -5.7
## can          -52.2  -16.7  3.9e+01   -5.6
## willoughby   -15.9  -11.3 -3.1e+00   -5.5
## may          -49.0   -9.7  2.9e+01   -5.5
## dear         -20.5  -17.3  8.5e+00   -5.4
## these        -37.9   22.0  6.0e+00   -5.1
## john         -16.9   -9.0 -6.0e+00   -4.8
## dashwood     -22.1  -16.0 -8.0e+00   -4.7
## all         -236.2   20.3  8.1e+00   -4.7
## same         -32.9    4.1  1.1e+00   -4.2
## again        -45.1   -8.7 -2.6e+00   -4.2
## door          -8.8   -2.5  4.4e-03   -4.2
##              PC1    PC2     PC3   PC4
## his       -446.0   29.5   -6.71 279.7
## he        -375.6  -49.7   15.59 238.7
## him       -229.9  -40.7   34.09 120.3
## had       -304.5  -98.5  -76.71  57.1
## was       -522.8 -136.2 -141.90  55.4
## that      -527.1  -21.6   71.46  34.6
## and      -1226.8  -27.0  -63.65  28.9
## a         -742.2   39.1   38.59  28.8
## to       -1240.7 -323.9  -13.29  26.9
## but       -301.8  -14.7   52.58  25.3
## himself    -39.4   -2.9   -3.90  20.0
## been      -136.4  -48.2    6.55  17.7
## man        -49.2    8.6   11.45  17.0
## with      -352.1  -12.6   -8.42  15.1
## more      -121.1  -18.2    0.66  12.3
## at        -281.6  -19.2   -2.76  12.3
## mr         -47.2  -25.7  -12.36  11.9
## wickham    -12.0  -11.1   -4.67  11.5
## ahab       -24.1   34.3    5.49  11.1
## so        -197.1  -21.6   30.05  10.3
## not       -363.1 -136.9   43.29  10.2
## little     -51.2   -8.8    2.53  10.0
## upon       -59.3   36.6   19.91   9.2
## seemed     -37.3   12.9   -4.83   8.9
## or        -120.3    3.5    6.19   7.8
## bingley    -23.6  -16.1   -1.40   7.7
## up         -52.1   17.8    3.87   7.5
## darcy      -28.5  -19.6    0.80   7.5
## ye         -16.5   20.1   10.40   7.0
## now        -92.0   23.0   11.09   6.9
## much       -84.3  -31.2   -3.00   6.8
## most       -56.4   -2.3    3.19   6.7
## did        -69.1  -25.3   10.89   6.7
## bennet     -25.4  -18.9    0.10   6.4
## stubb       -8.9   12.5    3.46   6.3
## could     -142.3  -73.1  -40.47   6.3
## there     -109.6    3.9   16.93   6.2
## old        -31.5   28.9   12.11   6.2
## very      -123.0  -54.2   13.10   6.1
## an        -120.6   -4.9   -5.87   6.0
## friend     -17.4  -10.2    0.82   6.0
## seen       -22.7    4.7    3.80   5.9
## men        -18.2   15.1    3.79   5.9
## handsome    -5.7   -4.1   -0.20   5.8
## like       -54.4   36.9   12.00   5.6
## said       -92.8  -46.6   33.13   5.6
## manners     -8.0   -5.3   -1.23   5.5
## never      -52.4  -18.6    9.24   5.2
## queequeg    -9.2   10.1    5.99   5.2
## captain    -12.4   16.1    3.89   5.0

PCA: Squaring the math and the code:

The data matrix \(X\) can be written using the singular value decomposition, as \[ X = U \Lambda V^T, \] where \(U^T U = I\) and \(V^T V = I\) and \(\Lambda\) has the singular values \(\lambda_1, \ldots, \lambda_m\) on the diagonal.

The best \(k\)-dimensional approximation to \(X\) is \[ \hat X = \sum_{i=1}^k \lambda_i u_i v_i^T , \] where \(u_i\) and \(v_i\) are the \(i\)th columns of \(U\) and \(V\).

Furthermore, \[ \sum_{ij} X_{ij} = \sum_i \lambda_i^2 . \]

A translation table

  1. \(u_\ell\) is the \(\ell\)th PC, standardized.
  2. \(v_\ell\) gives the loadings of the \(\ell\)th PC.
  3. \(\lambda_\ell^2 / \sum_{ij} X_{ij}^2\) is the percent variation explained by the \(\ell\)th PC.

Furthermore, since \(\Lambda U = X V\),

  1. \(\lambda_\ell u_\ell\) is the vector of values given by the linear combination of the variables with weights given by \(v_\ell\).

Translation to prcomp:

X <- lizards[,-(1:4)]
X.svd <- svd(X)
X.pca <- prcomp(X, center=FALSE, scale=FALSE)

# Singular values:
#  prcomps's sdevs are singular values / sqrt(n-1)
stopifnot(all(X.svd$d / sqrt(nrow(X) - 1) == X.pca$sdev))

# Loadings:
#  prcomp's loadings (returned as "rotation") are V
stopifnot(all(X.pca$rotation == X.svd$v))

# PCs:
#  prcomp's PCs (returned as "x")
#  are Lambda * U
stopifnot(all(abs(X.pca$x - X.svd$u * X.svd$d[col(X.svd$u)]) < 1e-10))

Translation to princomp:

princomp uses eigendecomposition of the covariance matrix, and so necessarily recenters the variables.

X.svd2 <- svd(scale(X, center=TRUE, scale=FALSE))
X.pca2 <- princomp(X, cor=FALSE, scores=TRUE)

# Singular values:
#  princomps's sdevs are singular values / sqrt(n)
stopifnot(all(abs(sqrt(eigen(cov(X))$values * (nrow(X)-1)/nrow(X)) - X.pca2$sdev) < 1e-10))
stopifnot(all(abs(X.svd2$d / sqrt(nrow(X)) - X.pca2$sdev) < 1e-10))

# Loadings:
#  princomp's loadings are V, up to sign
the_signs <- sign(X.pca2$loadings[1,] * X.svd2$v[1,])
stopifnot(all(abs(X.pca2$loadings - X.svd2$v * the_signs[col(X.svd2$v)]) < 1e-10))

# PCs:
#  princomp's PCs (returned as "scores")
#  are Lambda * U
stopifnot(all(abs(X.pca2$scores
                  - X.svd2$u * X.svd2$d[col(X.svd2$u)] * the_signs[col(X.svd2$u)]) < 1e-10))