10.4 Principal Component Analysis (PCA)

PCA

In this lab, we perform PCA on the $\texttt{USArrests}$ data set, which is part of the base $\texttt{R}$ package. The rows of the data set contain the 50 states, in alphabetical order.

states = row.names(USArrests)
states

##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"

The columns of the data set contain the four variables.

names(USArrests)

## [1] "Murder"   "Assault"  "UrbanPop" "Rape"

We first briefly examine the data. We notice that the variables have vastly different means.

apply(USArrests, 2, mean)

##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232

Note that the $\texttt{apply()}$ function allows us to apply a function $-$ in this case, the $\texttt{mean()}$ function $-$ to each row or column of the data set. The second input here denotes whether we wish to compute the mean of the rows, 1, or the columns, 2. We see that there are on average three times as many rapes as murders, and more than eight times as many assaults as rapes. We can also examine the variances of the four variables using the $\texttt{apply()}$ function.

apply(USArrests, 2, var)

##     Murder    Assault   UrbanPop       Rape 
##   18.97047 6945.16571  209.51878   87.72916

Not surprisingly, the variables also have vastly different variances: the $\texttt{UrbanPop}$ variable measures the percentage of the population in each state living in an urban area, which is not comparable number to the number of rapes in each state per 100,000 individuals. If we failed to scale the variables before performing PCA, then most of the principal components that we observed would be driven by the $\texttt{Assault}$ variable, since it has by far the largest mean and variance. Thus, it is important to standardize the variables to have mean zero and standard deviation one before performing PCA.

We now perform principal components analysis using the $\texttt{prcomp()}$ function, which is one of several function in $\texttt{R}$ that perform PCA.

pr.out = prcomp(USArrests, scale = TRUE)

By default, the $\texttt{prcomp()}$ function centers the variables to have mean zero. By using the option $\texttt{scale = TRUE}$, we scale the variables to have standard deviation one. The output of $\texttt{prcomp()}$ contains a number of useful quantities.

names(pr.out)

## [1] "sdev"     "rotation" "center"   "scale"    "x"

The $\texttt{center}$ and $\texttt{scale}$ components correspond to the means and standard deviations of the variables that were used for scaling prior to implementing PCA.

pr.out$center

##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232

pr.out$scale

##    Murder   Assault  UrbanPop      Rape 
##  4.355510 83.337661 14.474763  9.366385

The $\texttt{rotation}$ matrix provides the principal component loadings; each column of $\texttt{pr.out\$rotation}$ contains the corresponding principal component loading vector.

pr.out$rotation

##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

We see that there are four distinct principal components. This is to be expected because there are in general min($n - 1, p$) informative principal components in a data set with $n$ observations and $p$ variables.

Using the $\texttt{prcomp()}$ function, we do not need to explicitly multiply the data by the principal component loading vectors in order to obtain the principal component loading vectors in order to obtain the principal component score vectors. Rather the 50 x 4 matrix $\texttt{x}$ has as its columns the principal component score vectors. That is, the $k$th column is the $k$th principal component score vector.

dim(pr.out$r)

## [1] 4 4

We can plot the first two principal components as follows:

biplot(pr.out, scale = 0)

The $\texttt{scale = 0}$ argument to $\texttt{biplot()}$ ensures that the arrows are scaled to represent the loadings; other values for $\texttt{scale}$ give slightly different biplots with different interpretations.

Notice that this figure is a mirror image of Figure 10.1. Recall that the principal components are only unique up to a sign change, so we can reproduce Figure 10.1 by making a few small changes:

pr.out$rotation = -pr.out$rotation
pr.out$x = -pr.out$x
biplot(pr.out, scale = 0)

The $\texttt{prcomp()}$ function also outputs the standard deviation of each principal component. For instance, on the $\texttt{USArrests}$ data set, we can access these standard deviations as follows:

pr.out$sdev

## [1] 1.5748783 0.9948694 0.5971291 0.4164494

The variance explained by each principal component is obtained by squaring these:

pr.var = pr.out$sdev^2
pr.var

## [1] 2.4802416 0.9897652 0.3565632 0.1734301

To compute the proportion of variance explained by each principal component, we simply divide the variance explained by each principal component by the total variance explained by all four principal components:

pve = pr.var / sum(pr.var)
pve

## [1] 0.62006039 0.24744129 0.08914080 0.04335752

We see that the first principal component explains 62.0% of the variance in the data, the next principal component explains 24.7% of the variance, and so forth. We can plot the PVE explained by each component, as well as the cumulative PVE, as follows:

plot(pve, xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     ylim = c(0,1), type = 'b')

plot(cumsum(pve), xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     ylim = c(0,1), type = 'b')

The result is shown in Figure 10.4. Note that the function $\texttt{cumsum()}$ computes the cumulative sum of the elements of a numeric vector. For instance:

a = c(1,2,8,-3)
cumsum(a)

## [1]  1  3 11  8

Applied Exercise

Question 8

In section 10.2.3, a formula for calculating the Proportion of Variance Explained (PVE) was given in Equation 10.8. We also saw that the PVE can be obtained using the $\texttt{sdev}$ output of the $\texttt{prcomp()}$ function.

Equation 10.8

$\frac{\sum_{i=1}^{n}(\sum_{j=1}^{p} \phi_{jm} x_{ij})^{2} } { \sum_{j=1}^{p} \sum_{i=1}^{n} x^{2}_{ij}}$ - this gives the PVE of the mth principal component.

On the $\texttt{USArrests}$ data, calculate PVE in two ways:

Method 1: Using the $\texttt{sdev}$ output of the $\texttt{prcomp()}$ function, as we done in Section 10.2.3.
Method 2: By applying Equation 10.8 directly. That is, use the $\texttt{prcomp()}$ function to compute the principal component loadings. Then, use those loadings in Equation 10.8 to obtain the PVE.

Method 1 Solution:

Recall: Percent of Variance Explained = Variance explained / Total Variance

Calculate variance using standard deviation Note: The $\texttt{prcomp()}$ function gives the standard deviation.

pr.out = prcomp(USArrests , scale = TRUE)

# Variance = SD^2
pr.var = pr.out$sdev^2
pr.var # variances of each principal component

## [1] 2.4802416 0.9897652 0.3565632 0.1734301

Calculate total variance

# Sum all variances
pr.varSum = sum(pr.var)
pr.varSum # sum of principal component variances

## [1] 4

Divide variances by total variance (Step 1 / Step 2)

# Divide each principal component variance by total variance 
pve = pr.var/pr.varSum
pve

## [1] 0.62006039 0.24744129 0.08914080 0.04335752

Method 2 Solution:

Find loadings using $\texttt{prcomp()}$ function

Note: Loadings are found using “$rotation” from the $\texttt{prcomp()}$ function.

pr.loadings = pr.out$rotation
pr.loadings

##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

Scale data and convert data to matrix Note: $\texttt{%*%}$ means matrix multiplication

scaled.data = scale(USArrests)
scaled.matrix = as.matrix(scaled.data)
head(scaled.matrix) # view first 6 rows of data

##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207

Calculate total variance

var.sum <- sum(apply(scaled.matrix^2 , 2, sum)) # 2 indicates applying function to columns

Matrix multiplication and divide by total variance Note: $\texttt{%*%}$ means matrix multiplication

pve = apply((scaled.matrix %*% pr.loadings)^2, 2, sum) / var.sum
pve

##        PC1        PC2        PC3        PC4 
## 0.62006039 0.24744129 0.08914080 0.04335752

To learn more about PCA via an interactive example, visit: https://setosa.io/ev/principal-component-analysis/