11.1 Principal Component Analysis

We use “iris” dataset that contains four measurements of the size of the iris flowers. We want to ask whether these measurements share the same underlying variance of plant size. Here is the overview of the dataset:

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

We focus on the previous four variables for PCA:

irisN <- subset(iris,select=-Species) # Remove the species designator
str(irisN)
## 'data.frame':    150 obs. of  4 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

The correlation matrix of the four variables is:

round(cor(irisN),digits=3) # Show correlation matrix for the iris data
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length        1.000      -0.118        0.872       0.818
## Sepal.Width        -0.118       1.000       -0.428      -0.366
## Petal.Length        0.872      -0.428        1.000       0.963
## Petal.Width         0.818      -0.366        0.963       1.000

From this matrix, we can see that Sepal.Width seems somewhat disconnected from the rest of the variables because the correlation between it and one of the others is weak correlated compare to other pairs. Now let’s look at the PCA.

# install.packages("psych")
library(psych)
irisNout <- principal(irisN)
irisNout
## Principal Components Analysis
## Call: principal(r = irisN)
## Standardized loadings (pattern matrix) based upon correlation matrix
##                PC1   h2    u2 com
## Sepal.Length  0.89 0.79 0.208   1
## Sepal.Width  -0.46 0.21 0.788   1
## Petal.Length  0.99 0.98 0.017   1
## Petal.Width   0.96 0.93 0.069   1
## 
##                 PC1
## SS loadings    2.92
## Proportion Var 0.73
## 
## Mean item complexity =  1
## Test of the hypothesis that 1 component is sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.13 
##  with the empirical chi square  28.19  with prob <  7.6e-07 
## 
## Fit based upon off diagonal values = 0.97

Principal component analysis recognize both variance and covariance from the variables. principal() tries to map the common variance among the input variables into the first principal component. The PC1 is a linear combination of the four variables. The loadings are the coefficients of the linear combination. You can see that the Sepal Width has a significantly smaller loading (-0.46) compared to the other three widths, which all load quite heavily onto the first major component. h2 represents communality, which is the proportion of variance in the input variable explained by the component. Because we choose a one-factor solution, the communality is the square of the loading. More generally, the communality is the sum of the squared loadings for each component. The u2 represents uniqueness, which is the proportion of variance in the input variable that is not explained by the component. The com represents complexity, which is the spread out of a variable’s loadings across the factors where a value closer to 1 means more simplicity and a value larger than 1 means more complexity.

The next section focus on the loadings. SSLoadings (eigenvalues) is the sum of squares and Proportion Var is the proportion of variance account for by the principal component. The last part is a test to see whether one principal component is enough to represent the data. In this case a siginificant result means that there is a substantial amount of variance in the data that is not explained by the first principal component. The Proportion Var fails to account for enough of the variance in the data to be considered a good fit.

Therefore, we continue to explore a two-component solution.

irisNout <- principal(irisN, nfactors=2)
irisNout
## Principal Components Analysis
## Call: principal(r = irisN, nfactors = 2)
## Standardized loadings (pattern matrix) based upon correlation matrix
##                RC1   RC2   h2     u2 com
## Sepal.Length  0.96  0.05 0.92 0.0774 1.0
## Sepal.Width  -0.14  0.98 0.99 0.0091 1.0
## Petal.Length  0.94 -0.30 0.98 0.0163 1.2
## Petal.Width   0.93 -0.26 0.94 0.0647 1.2
## 
##                        RC1  RC2
## SS loadings           2.70 1.13
## Proportion Var        0.68 0.28
## Cumulative Var        0.68 0.96
## Proportion Explained  0.71 0.29
## Cumulative Proportion 0.71 1.00
## 
## Mean item complexity =  1.1
## Test of the hypothesis that 2 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.03 
##  with the empirical chi square  1.72  with prob <  NA 
## 
## Fit based upon off diagonal values = 1

The very small empirical chi-square with prob<NA indicates that two principal components account for virtually all the variance in the original variables. We can see that Sepal.Length, Petal.Length, and Petal.Width load mainly on the first component and Sepal.Width loads mainly on the second component.

It is noted that we need to control the scale of each variable before adding or averaging the measurements. If they are in the same scale (e.g., Likert-scale), we can create a composite by averaging the items.

summary(irisN)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

Each item has a different scale range. So we need to standardize the scale of each item before averaging them.

irisNS <- scale(irisN)  # standardize each variable
flowerSize <- (irisNS[,1]+ irisNS[,3]+ irisNS[,4])/3
length(flowerSize)
## [1] 150
mean(flowerSize)
## [1] -1.782016e-16
sd(flowerSize)
## [1] 0.9606202