This lab is relatively short, use any spare time to get started on your data analysis.
Using the tapply
function to find summaries by group
Log transform practice
For this purposes of this lab, we’ll use the data you are using for Data Analysis #1, you don’t need any of the material here to complete Data Analysis #1, but it may help when formulating your own question about the data.
Quite often we want summary statistics calculated within some groups. We’ve already seen a strategy for finding the average, standard deviation and sample size for certain groups: we used subset
to get observations that correspond to one group, then calculated our summary using that subset. For example, if we wanted to know the average number of bedrooms for households that rent, we could do:
If we also wanted the average the average number of bedrooms for households that own free and clear, we would repeat the process with some modifications:
But there are another two categories in own and doing this soon becomes tiring! Luckily there is an easier way, the tapply
function. tapply
(short for table apply) takes three arguments, a numeric vector you want summarise, a factor vector that describes the categories to summarise within and a function to do the summarising, for example:
It’s easier to read what’s happening from the right to left. We want to take the mean
in each category of acs$own
of the acs$bedrooms
variable. We can save typing the acs$
part by using the with
function,
We can use the same idea to find the sample standard deviations for each group.
To get the number of observations for each group ,
But be aware that is there are missing values this can be dangerous. There aren’t any here, but a safer way to count observations is
Can you find the mean
electricity cost by the decade the house was built?
Can you find the mean
and median
income of the husband by whether the household has internet or not?
The dataset email_sample
is a random sample of 100 spam emails and 100 non-spam (i.e. Ham) emails from the dataset emails
in the openintro
package. For each email the number of characters in the email is recorded.
(Actually num_char was the number of characters in thousands, hence the decimals.)