Goals:
For this lab we’ll make use of this data from the OpenIntro book:
The Great Britain Office of Population Census and Surveys once collected data on a random sample of 170 married couples in Britain, recording the age (in years) and heights (in mm) of the husbands and wives.
We’ll be interested in using the husband’s age to predict the wife’s age. It’s always good to start with a scatterplot of the data. By convention, we put the explanatory variable on the x-axis and the response on the y-axis.
In general we see a pretty strong linear trend. We can add the least squares simple linear regression line by adding a geom_smooth
layer using the lm
function as our model for the mean:
While geom_smooth
happily adds a linear regression line, it doesn’t give us any information about the line that was fit, to get that information we need to fit it ourselves using lm
. For a simple linear regression the first argument is always of the form response ~ explanatory
, (y ~ x, note that this is the opposite order to our plot). Here we’ll fit the model of interest and save it in a variable called fit
:
There are a lot of functions designed specifically to operate on model fits, for example:
gives a summary of the fitted model. Can you identify the estimates of the slope and intercept? Can you find the estimated subpopulation standard deviation, σ?
Sometimes you want to take the output of summary and use it for further calculations. If we use the str
function to examine the structure of summary(fit)
, we see it is a list, and we can access it’s elements with $
. For example, we might pull out the estimate of σ:
Before we interpret the model or make any predictions, it would be wise to check the simple linear regression assumptions. We generally look at three plots: residuals against fitted values, explanatory values against fitted values, and a normality plot of the residuals.
Adding a horizontal line at zero and a smooth aids the eye in looking for violations of the assumptions:
We are looking for our residuals to be centered (in the vertical direction) around the zero line for linearity, and with a roughly equal spread (again in the vertical direction) as we move from right to left for the constant variance assumption. In this case, linearity looks good, for all fitted values the residuals seem centered around zero. There is maybe some indication of less spread for smaller fitted values, but it is quite mild.
A similar story is told by the plot of the residuals against the explanatory values, husband’s age:
The normal probability plot of the residuals:
reveals the residuals have longer tails than we would expect from Normal data. We should be careful with our prediction intervals, assuming Normality when in fact we have a distribution with longer tails, will lead to prediction intervals that are too narrow.
Ok, the Normality assumption may be suspect but we can still interpret confidence intervals on the parameters of the model. Our fitted line is:
\( \hat{\mu}\){Wife’s age | Husband’s age} = 1.57 + 0.91 Husband’s age. Confidence intervals of the parameters can be found with the confint
function:
With 95% confidence, an increase in the husband’s age of one year is associated with an increase in the mean wives age of between 0.86 and 0.96 years. What kind of questions could we answer with these intervals?
Imagine we want a confidence interval of the mean of the wives ages for husbands that are 25, 40 and 55. We first need to set up a new data.frame that contains just these values, then use the predict
function with our fitted model and new data.
Finding prediction intervals for the age of a wife whose husband is 25, 40 or 55 is similar except we substitute prediction
for confidence
:
Notice how much wider the prediction interval is. With 95% confidence, the average age of wives with 25 year old husbands is between 23.3 and 25.4 (95% CI), but with 95% confidence we would predict the actual age of a wife with a 25 year old husband is between 16.5 and 32.2 years old. But remember we had some doubts about the Normality assumption, this casts into question the accuracy of our prediction interval. Since, we observed longer tails in our residuals than expected we might argue this prediction interval is probably a little narrow.
Using the same data predict the wife’s height (WHght
) from the husband’s height (HHght
):
To add predictions (or confidence intervals) to a plot of the raw data we need to combine the values needed to plot (i.e. the explanatory variable) with the predictions and interval end points. Here’s a simple example: