5.2 R illustration of Frequentist ANOVA
Let’s use the precip data that contains precipitation amounts from 70 U.S. cities in a roughly normal distribution to create three groups.
# Run ANOVA on groups sampled from the same population
set.seed(10)
# Enough for 3 groups of 20
<- sample(precip, 60, replace=TRUE)
precipAmount # Group desinators, 3 groups
<- as.factor(rep(seq(from=1, to=3, by=1), 20))
precipGrp <- data.frame(precipAmount, precipGrp)
precipDF # Get a box plot of the distribs
boxplot(precipAmount ~ precipGrp, data=precipDF)
# Run the ANOVA
<- aov(precipAmount ~ precipGrp, data=precipDF)
precipOut summary(precipOut) # Provide an ANOVA table
## Df Sum Sq Mean Sq F value Pr(>F)
## precipGrp 2 352 176.2 0.995 0.376
## Residuals 57 10096 177.1
Interpretation
- DF: A statistical measure of how many components of a collection are subject to variation once a base set of statistics has been computed; One degree of freedom is lost from a set of 60 data points while calculating the grand mean; only two of the three group means can fluctuate freely, leaving 57 degrees of freedom within groups (aka. residuals).
- Sum Sq: the first line is the “between-groups” sum of squares; the second line is the “within- groups” sum of squares
- Mean Sq: variance, or sum of squares divided by degrees of freedom. The first line represents the variance “between groups,” and the second line represents the variance “within- groups.”
- F value: the ratio of the between-groups variance to the within-groups variance.
- Pr(>F): the probability of a larger F-ratio. This is the likelihood of obtaining an F-value at least this high in the random distribution of F-ratios for the degrees of freedom shown in this table. The smaller the probability, the more likely it is that the groups are from different populations.
- precipGrp: the independent variable.
- Residuals: accounts for all the within-groups variability, it is what is left over when all the systematic variance (precipGrp) is removed.
We anticipate that the between-groups variance (mean square) will be about equal to the within-groups variance when the data from all the groups is sampled from the same underlying population (mean square). The former is determined based on the variation in the raw data, whereas the latter is determined based on the spread of the means. F must be subtracted larger than one for ANOVA result to be statistically significant, and the Pr(>F) must be less than the alpha level to reject the null hypothesis, which is no different among groups.