Stat 411/511

Lab 8

Nov 31/Dec 1

Goals:

Fitting a linear regression model in R and extracting parameters
Assessing linear regression assumptions
Calculating confidence intervals

Fitting a linear regression model in R

For this lab we’ll make use of this data from the OpenIntro book:

The Great Britain Office of Population Census and Surveys once collected data on a random sample of 170 married couples in Britain, recording the age (in years) and heights (in mm) of the husbands and wives.

age <- read.table(url("http://stat511.cwick.co.nz/labs/husbandwife.txt"), 
  sep = "\t", header = TRUE)
head(age)

We’ll be interested in using the husband’s age to predict the wife’s age. It’s always good to start with a scatterplot of the data. By convention, we put the explanatory variable on the x-axis and the response on the y-axis.

library(ggplot2)
qplot(HAge, WAge, data = age)

In general we see a pretty strong linear trend. We can add the least squares simple linear regression line by adding a geom_smooth layer using the lm function as our model for the mean:

qplot(HAge, WAge, data = age) + geom_smooth(method = "lm")

While geom_smooth happily adds a linear regression line, it doesn’t give us any information about the line that was fit, to get that information we need to fit it ourselves using lm. For a simple linear regression the first argument is always of the form response ~ explanatory, (y ~ x, note that this is the opposite order to our plot). Here we’ll fit the model of interest and save it in a variable called fit:

fit <- lm(WAge ~ HAge, data = age)

There are a lot of functions designed specifically to operate on model fits, for example:

summary(fit)

gives a summary of the fitted model. Can you identify the estimates of the slope and intercept? Can you find the estimated subpopulation standard deviation, σ? Sometimes you want to take the output of summary and use it for further calculations. If we use the str function to examine the structure of summary(fit), we see it is a list, and we can access it’s elements with $. For example, we might pull out the estimate of σ:

str(summary(fit))
summary(fit)$sigma

Assessing assumptions

Before we interpret the model or make any predictions, it would be wise to check the simple linear regression assumptions. We generally look at three plots: residuals against fitted values, explanatory values against fitted values, and a normality plot of the residuals.

Adding a horizontal line at zero and a smooth aids the eye in looking for violations of the assumptions:

qplot(.fitted, .resid, data = fit)  +
  geom_hline(yintercept = 0) +
  geom_smooth(se = FALSE) +
  ggtitle("Residuals against fitted values")

We are looking for our residuals to be centered (in the vertical direction) around the zero line for linearity, and with a roughly equal spread (again in the vertical direction) as we move from right to left for the constant variance assumption. In this case, linearity looks good, for all fitted values the residuals seem centered around zero. There is maybe some indication of less spread for smaller fitted values, but it is quite mild.

A similar story is told by the plot of the residuals against the explanatory values, husband’s age:

qplot(HAge, .resid, data = fit) +
  ggtitle("Residuals against HAge")

The normal probability plot of the residuals:

source(url("http://stat511.cwick.co.nz/code/stat_qqline.r"))
qplot(sample = .resid, data = fit)  +
  stat_qqline() +
  ggtitle("Normal probability plot of residuals")

reveals the residuals have longer tails than we would expect from Normal data. We should be careful with our prediction intervals, assuming Normality when in fact we have a distribution with longer tails, will lead to prediction intervals that are too narrow.

Inference

Ok, the Normality assumption may be suspect but we can still interpret confidence intervals on the parameters of the model. Our fitted line is: $ \hat{\mu}${Wife’s age | Husband’s age} = 1.57 + 0.91 Husband’s age. Confidence intervals of the parameters can be found with the confint function:

confint(fit)

With 95% confidence, an increase in the husband’s age of one year is associated with an increase in the mean wives age of between 0.86 and 0.96 years. What kind of questions could we answer with these intervals?

Is the slope zero? That would imply the average wife’s age doesn’t depend on the husband’s age, I think we already know that isn’t true.
Is the slope one? That would imply the mean wife’s ages differed from the husband’s age by only the intercept, some constant for all ages of husbands. This could be interesting, and we have evidence against it (our CI on the slope doesn’t contain one).
Is the intercept zero? Not really relevant here, no one gets married to a husband of zero years. It might be interesting in combination with a slope of one. If the intercept is zero and the slope is one, then the mean wives age equals the husbands age.

Imagine we want a confidence interval of the mean of the wives ages for husbands that are 25, 40 and 55. We first need to set up a new data.frame that contains just these values, then use the predict function with our fitted model and new data.

newdf <- data.frame(HAge = c(25, 40, 55))
predict(fit, newdf, interval = "confidence")

Finding prediction intervals for the age of a wife whose husband is 25, 40 or 55 is similar except we substitute prediction for confidence:

predict(fit, newdf, interval = "prediction")

Notice how much wider the prediction interval is. With 95% confidence, the average age of wives with 25 year old husbands is between 23.3 and 25.4 (95% CI), but with 95% confidence we would predict the actual age of a wife with a 25 year old husband is between 16.5 and 32.2 years old. But remember we had some doubts about the Normality assumption, this casts into question the accuracy of our prediction interval. Since, we observed longer tails in our residuals than expected we might argue this prediction interval is probably a little narrow.

Your turn

Using the same data predict the wife’s height (WHght) from the husband’s height (HHght):

Make a scatterplot of the wife’s height against the husband’s height and include the fitted linear regression line.
Fit the simple linear regression model and explore the residual plots for violations of the assumptions.
Interpret the fitted coefficients of the fitted model.
Produce a confidence interval for the mean height of wives whose husbands are 1600mm tall.
Produce a prediction interval for the height of a wife whose husband is 1600mm tall.

Adding predictions to a plot (optional)

To add predictions (or confidence intervals) to a plot of the raw data we need to combine the values needed to plot (i.e. the explanatory variable) with the predictions and interval end points. Here’s a simple example:

PI <- cbind(newdf, predict(fit, newdf, interval = "prediction"))
PI 

qplot(HAge, WAge, data = age) +
  geom_pointrange(aes(y = fit, ymin = lwr, ymax = upr), colour = "blue",
    data = PI)