Stat 411/511

Lab 7

Nov 16/17


  • Running anova in R
  • Checking anova assumptions
  • Linear combinations of means in R

One way anova in R

We saw the R code for a one way ANOVA on the Spock trial in lecture:

anova(lm(Percent ~ Judge, data = case0502))

We are actually doing two things in one go, fitting a separate means model:

full_model <- lm(Percent ~ Judge, data = case0502)

then running an ANOVA on the separate means model:


When you give the anova function a single model, it runs what we call a sequential analysis of variance, but for our purposes we just need to know it compares our full model to a model with a single mean, i.e. a one way anova. If you give anova two models it will compare them with an Extra Sums of Squares F-test. So, for example we could fit an equal means model explicitly and compare it to the full model,

null_model <- lm(Percent ~ 1, data = case0502)
anova(full_model, null_model)

And you’ll see we get exactly the same results.

The lm function deserves a little more explanation. lm is short for linear model and is R’s general purpose function for fitting regression models. The first argument specifies the model. For lm the column on the left hand side of the ~ is the mean we want to model, on the right the terms we want to model it with. So Percent ~ Judge says we want to model the mean Percent as a function of the Judge. Since Judge contains categories, this means one mean parameter for each Judge (if Judge was numeric we would end up with a simple linear regression). Percent ~ 1 is interpreted as modelling the mean Percent by a single mean.

Anova Assumptions

To check the anova assumptions it’s generally easiest to examine the residuals from the full model, although sometimes you will pick up gross violations just from a plot of the raw response. We’ve already fit the full model, so we can generate the plots we covered in lecture with:

qplot(.fitted, .resid, data = full_model)
qplot(Judge, .resid, data = full_model)
qplot(sample = .resid, data = full_model) + stat_qqline()

What would evidence of a violation look like in each plot? Your TA should go through these plots with you.

Now to get some practice. InsectSprays contains counts of insects in an experiment comparing different insecticide sprays. We are interested in whether there is evidence any of the sprays have a different mean count of insects. Take a look at the raw data:

qplot(reorder(spray, count), count, data = InsectSprays, geom= "boxplot")

Do you see any evidence of violations of the ANOVA assumptions?

Take a look at the residual plots. Can you see the violations more easily?

spray_full <-  lm(count ~ spray, data = InsectSprays) 
qplot(.fitted, .resid, data = spray_full ) 
qplot(spray, .resid, data = spray_full )
qplot(sample = .resid, data = spray_full) + stat_qqline()

Do the residual plots look better if we instead model the mean of the squareroot of insect count?

spray_full_sqrt <-  lm(sqrt(count) ~ spray, data = InsectSprays)
qplot(.fitted, .resid, data = spray_full_sqrt  )
qplot(spray, .resid, data = spray_full_sqrt  )
qplot(sample = .resid, data = spray_full_sqrt ) + stat_qqline()

What would you conclude from this ANOVA?

## Analysis of Variance Table
## Response: sqrt(count)
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## spray      5 88.438 17.6876  44.799 < 2.2e-16 ***
## Residuals 66 26.058  0.3948                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Linear combinations of means

In lecture on Friday we talked about inference on linear combinations of means in multiple group settings. Let’s do an example using R and the Spock data. Imagine we want to compare Spock to Judge C. Since this is a two group comparison the natural comparison to make is the difference of their means, i.e. \(\gamma = \mu_{\text{Spock}} - \mu_{\text{C}} \). Let’s work through finding a 95% CI for this difference and testing whether it is zero in R. First, we need to calculate sample averages, standard deviations and sizes, and the pooled standard deviation:

# sample averages
averages <- with(case0502, tapply(Percent, Judge, mean))
# sample sds
sds <- with(case0502, tapply(Percent, Judge, sd))
# sample sizes
ns <- with(case0502, tapply(Percent, Judge, length))
sp <- sqrt(sum(sds^2*(ns-1))/sum(ns - 1))
df <- sum(ns-1)

Take a look at averages


When we write our linear combination we need to make sure it’s in the same order as our averages are in R. Here Judge C is the 3rd entry and Spock’s Judge is the 7th entry. Our constants in our linear combination would be: \( C_1 = 0, C_2 = 0, C_3 = -1, C_4 = 0, C_5 = 0, C_6 = 0, C_7 = 1 \). In R we create a new vector to represent the C’s:

C <- c(0,0,-1,0,0,0,1)

Now finding the estimate of the linear combination and it’s standard error is easy:

g <- sum(C*averages) 
se_g <- sp * sqrt(sum(C^2/ns))

And we can use them to get 95% confidence intervals or the p-value for the t-test that the linear combination is zero

g + c(-1, 1) * qt(0.975, df) *se_g
2*(1 - pt(abs(g / se_g), df))

Repeat the steps to compare Spock’s mean to the average of the means of the other judges, i.e. \( C_1 = -1/6, C_2 = -1/6, C_3 = -1/6, C_4 = -1/6, C_5 = -1/6, C_6 = -1/6, C_7 = 1 \)

Got spare time?

Do the practice problems posted under lecture for Nov 9th.