11.1 Principal Component Analysis
We use “iris” dataset that contains four measurements of the size of the iris flowers. We want to ask whether these measurements share the same underlying variance of plant size. Here is the overview of the dataset:
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
We focus on the previous four variables for PCA:
<- subset(iris,select=-Species) # Remove the species designator
irisN str(irisN)
## 'data.frame': 150 obs. of 4 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
The correlation matrix of the four variables is:
round(cor(irisN),digits=3) # Show correlation matrix for the iris data
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.000 -0.118 0.872 0.818
## Sepal.Width -0.118 1.000 -0.428 -0.366
## Petal.Length 0.872 -0.428 1.000 0.963
## Petal.Width 0.818 -0.366 0.963 1.000
From this matrix, we can see that Sepal.Width seems somewhat disconnected from the rest of the variables because the correlation between it and one of the others is weak correlated compare to other pairs. Now let’s look at the PCA.
# install.packages("psych")
library(psych)
<- principal(irisN)
irisNout
irisNout## Principal Components Analysis
## Call: principal(r = irisN)
## Standardized loadings (pattern matrix) based upon correlation matrix
## PC1 h2 u2 com
## Sepal.Length 0.89 0.79 0.208 1
## Sepal.Width -0.46 0.21 0.788 1
## Petal.Length 0.99 0.98 0.017 1
## Petal.Width 0.96 0.93 0.069 1
##
## PC1
## SS loadings 2.92
## Proportion Var 0.73
##
## Mean item complexity = 1
## Test of the hypothesis that 1 component is sufficient.
##
## The root mean square of the residuals (RMSR) is 0.13
## with the empirical chi square 28.19 with prob < 7.6e-07
##
## Fit based upon off diagonal values = 0.97
Principal component analysis recognize both variance and covariance from the variables. principal() tries to map the common variance among the input variables into the first principal component. The PC1 is a linear combination of the four variables. The loadings are the coefficients of the linear combination. You can see that the Sepal Width has a significantly smaller loading (-0.46) compared to the other three widths, which all load quite heavily onto the first major component. h2 represents communality, which is the proportion of variance in the input variable explained by the component. Because we choose a one-factor solution, the communality is the square of the loading. More generally, the communality is the sum of the squared loadings for each component. The u2 represents uniqueness, which is the proportion of variance in the input variable that is not explained by the component. The com represents complexity, which is the spread out of a variable’s loadings across the factors where a value closer to 1 means more simplicity and a value larger than 1 means more complexity.
The next section focus on the loadings. SSLoadings (eigenvalues) is the sum of squares and Proportion Var is the proportion of variance account for by the principal component. The last part is a test to see whether one principal component is enough to represent the data. In this case a siginificant result means that there is a substantial amount of variance in the data that is not explained by the first principal component. The Proportion Var fails to account for enough of the variance in the data to be considered a good fit.
Therefore, we continue to explore a two-component solution.
<- principal(irisN, nfactors=2)
irisNout
irisNout## Principal Components Analysis
## Call: principal(r = irisN, nfactors = 2)
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC2 h2 u2 com
## Sepal.Length 0.96 0.05 0.92 0.0774 1.0
## Sepal.Width -0.14 0.98 0.99 0.0091 1.0
## Petal.Length 0.94 -0.30 0.98 0.0163 1.2
## Petal.Width 0.93 -0.26 0.94 0.0647 1.2
##
## RC1 RC2
## SS loadings 2.70 1.13
## Proportion Var 0.68 0.28
## Cumulative Var 0.68 0.96
## Proportion Explained 0.71 0.29
## Cumulative Proportion 0.71 1.00
##
## Mean item complexity = 1.1
## Test of the hypothesis that 2 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.03
## with the empirical chi square 1.72 with prob < NA
##
## Fit based upon off diagonal values = 1
The very small empirical chi-square with prob<NA indicates that two principal components account for virtually all the variance in the original variables. We can see that Sepal.Length, Petal.Length, and Petal.Width load mainly on the first component and Sepal.Width loads mainly on the second component.
It is noted that we need to control the scale of each variable before adding or averaging the measurements. If they are in the same scale (e.g., Likert-scale), we can create a composite by averaging the items.
summary(irisN)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Each item has a different scale range. So we need to standardize the scale of each item before averaging them.
<- scale(irisN) # standardize each variable
irisNS <- (irisNS[,1]+ irisNS[,3]+ irisNS[,4])/3
flowerSize length(flowerSize)
## [1] 150
mean(flowerSize)
## [1] -1.782016e-16
sd(flowerSize)
## [1] 0.9606202