This lab is relatively short, use any spare time to get started on your data analysis.
tapply function to find summaries by group
Log transform practice
For this purposes of this lab, we’ll use the data you are using for Data Analysis #1, you don’t need any of the material here to complete Data Analysis #1, but it may help when formulating your own question about the data.
Quite often we want summary statistics calculated within some groups. We’ve already seen a strategy for finding the average, standard deviation and sample size for certain groups: we used
subset to get observations that correspond to one group, then calculated our summary using that subset. For example, if we wanted to know the average number of bedrooms for households that rent, we could do:
If we also wanted the average the average number of bedrooms for households that own free and clear, we would repeat the process with some modifications:
But there are another two categories in own and doing this soon becomes tiring! Luckily there is an easier way, the
tapply (short for table apply) takes three arguments, a numeric vector you want summarise, a factor vector that describes the categories to summarise within and a function to do the summarising, for example:
It’s easier to read what’s happening from the right to left. We want to take the
mean in each category of
acs$own of the
acs$bedrooms variable. We can save typing the
acs$ part by using the
We can use the same idea to find the sample standard deviations for each group.
To get the number of observations for each group ,
But be aware that is there are missing values this can be dangerous. There aren’t any here, but a safer way to count observations is
Can you find the
mean electricity cost by the decade the house was built?
Can you find the
median income of the husband by whether the household has internet or not?
email_sample is a random sample of 100 spam emails and 100 non-spam (i.e. Ham) emails from the dataset
emails in the
openintro package. For each email the number of characters in the email is recorded.
(Actually num_char was the number of characters in thousands, hence the decimals.)