Stat 411/511

# Goals

This lab is relatively short, use any spare time to get started on your data analysis.

• Using the tapply function to find summaries by group

• Log transform practice

For this purposes of this lab, we’ll use the data you are using for Data Analysis #1, you don’t need any of the material here to complete Data Analysis #1, but it may help when formulating your own question about the data.

## tapply

Quite often we want summary statistics calculated within some groups. We’ve already seen a strategy for finding the average, standard deviation and sample size for certain groups: we used subset to get observations that correspond to one group, then calculated our summary using that subset. For example, if we wanted to know the average number of bedrooms for households that rent, we could do:

If we also wanted the average the average number of bedrooms for households that own free and clear, we would repeat the process with some modifications:

But there are another two categories in own and doing this soon becomes tiring! Luckily there is an easier way, the tapply function. tapply (short for table apply) takes three arguments, a numeric vector you want summarise, a factor vector that describes the categories to summarise within and a function to do the summarising, for example:

It’s easier to read what’s happening from the right to left. We want to take the mean in each category of acs$own of the acs$bedrooms variable. We can save typing the acs\$ part by using the with function,

We can use the same idea to find the sample standard deviations for each group.

To get the number of observations for each group ,

But be aware that is there are missing values this can be dangerous. There aren’t any here, but a safer way to count observations is

Can you find the mean electricity cost by the decade the house was built?
Can you find the mean and median income of the husband by whether the household has internet or not?

## Log transform practice

The dataset email_sample is a random sample of 100 spam emails and 100 non-spam (i.e. Ham) emails from the dataset emails in the openintro package. For each email the number of characters in the email is recorded.

(Actually num_char was the number of characters in thousands, hence the decimals.)

• Examine the histogram of the data on the raw and transformed scale. Do the assumptions seem better met on the log scale?
• Try writing a statistical summary from the results of the t-test. You can compare your answers to this summary.