Stat 411/511

Lab 4

Oct 26/27

Today’s goals

  • Examining data to verify assumptions
  • Taking logarithms and exponentials (aka antilogarithms) in R.

Examing data for violations of assumptions

There are two graphical tools for examining two samples for violations of the two sample t-test assumptions: histograms and normal probability plots. Histograms are most helpful for diagnosing samples that come from populations with different standard deviations, although they will also reveal problems like skewness or outliers. Normal probability plots are best for checking Normality.

Let’s use the Finch beak depth data to illustrate the code

library(Sleuth3)
library(ggplot2)
head(case0201)
##   Year Depth
## 1 1976   6.2
## 2 1976   6.8
## 3 1976   7.1
## 4 1976   7.1
## 5 1976   7.4
## 6 1976   7.8

Equal population standard deviations

We already know how to plot histograms for two samples:

qplot(Depth, data = case0201) + facet_wrap(~ Year, ncol = 1)

The ncol = 1 makes it easy to compare the spread of the two samples. These two samples seem to have similar spread, so there isn’t evidence they come from populations with different standard deviations. It’s even easier to compare if we shift each sample to have the same center by subtracting off the sample average for each year. (Don’t worry about the first line of this code, we’ll see it when we talk about multiple groups)

case0201$year_avg <- with(case0201, ave(Depth, Year))
qplot(Depth - year_avg, data = case0201) + facet_wrap(~ Year, ncol = 1)

There are a few large beak depth values in 1976 that make it appear that that year is a little more spread out, but we don’t want to be overly influenced by a few values in a sample of 89 values.

Another trick to help decide if these sample have unusually different spreads for samples coming from populations with the same standard deviations, is to compare them to plots of simulated data where we know the population standard deviations are the same. Run this code to get two functions: simulated_hists and simulated_prob_plots, I wrote to simulate data with similar samples sizes and averages as our data but drawn from Normal populations with equal standard deviations.

source(url("http://stat511.cwick.co.nz/code/simulated_plots.r"))

Run:

simulated_hists("Depth", "Year", case0201)

That will create a plot just like the one we have for the real data, but it is based on simulated data the we know comes from populations with the same standard deviation. This helps us understand the sampling variability we would see in the spread of the histograms when the assumption is not violated.

Run the above command a few more times, then look at the real data again. The difference in spread in the real data, certainly doesn’t seem unusual compared to samples from populations with the same standard deviation. There is no evidence that the populations have different standard deviations.

Normal probability plots

Normal probability plots are also easy to do in ggplot2

qplot(sample = Depth, data = case0201) + facet_wrap(~ Year)

The only difference between this and the histogram code is that we named the first argument sample. The first argument is ordinarily x and qplot does a histogram by default. If qplot is given a sample argument instead it does a normal probability plot. To add a line, I wrote some code. Grab my function

source(url("http://stat511.cwick.co.nz/code/stat_qqline.r"))

then you can add lines to your plots:

qplot(sample = Depth, data = case0201) + facet_wrap(~ Year) +
  stat_qqline()

The points from both samples show some systematic deviations from the straight line. In 1976, there are small beak depths that are smaller than we would expect from a Normal distribution (they lie below the line on the left), and there are large beak depths that are also smaller than we would expect from a Normal distribution (they lie below the line on the right). This is a sign the data are skewed. Again we can get a grip on the sampling variability by simulating data with the same properties from Normal distirbutions and repeating the plots.

simulated_prob_plots("Depth", "Year", case0201)

Run the code a few times and compare it to the real data.

Looks like the deviation from a straight line we see in the real data is unlikely to be just due to sampling varibility. We have evidence the samples are not from Normal populations.

So, why could we do a two sample t-test? We go back to the histograms and see that the skew is mild, and similar in both samples. The samples sizes are the same, and quite large (89). We therefore appeal to the Central Limit Theorem and claim robustness of the two sample t-test to this violation. We could further back this up by doing some simulation and estimating the coverage probability of confidence intervals based on the two sample t using data that exhibits a similar amount of skew.

Practice identifying violations with these data sets.

  • Compare Rainfall between the Seeded and Unseeded Treatment in case0301.

  • Compare Lifetime between the Bacilli and Control Group in ex0211.

  • Compare the BP (reduction in blood pressure) between the Fish oil and Regular oil Diet groups in ex0112. Why is this one harder than the others?

Logarithms (and a two-sample t-test)

In lecture last week we talked about the logarithmic transformation. The idea is that if our data looks like it might not satisfy the assumptions of the t-tools but a transformation of the data will, then we transform the data, do a t-test on the transformed data, then back-transform the estimate and confidence interval.

There are many possible transformations. We will only cover the logarithmic transformation. It is useful for positive data, that exhibits right skew and different standard deviations. The natural logarithm is the function log in R. It’s inverse is the exponential (or antilogarithm) function, exp in R.

x <- 1:10
log(x)
exp(log(x))

In class we will see the Cloud Seeding case study. A randomized experiment is performed where rainfall is measured after “Seeding” a cloud and after “Not Seeding” a cloud.

A histogram of the raw data:

library(ggplot2)
library(Sleuth2)
qplot(Rainfall, data = case0301) + facet_wrap(~ Treatment, ncol = 1)

Notice the data is very right skewed, and there may be evidence the Rainfall from Seeded clouds might come from a population with a slightly larger standard deviation than that of Unseeded clouds.

A histogram of the logarithm of the rainfall is created by surrounding Rainfall with the log function

qplot(log(Rainfall), data = case0301) + facet_wrap(~ Treatment, ncol = 1)

The easiest way to work with the transformed data is to add a new column to the data.frame to hold the log transformed data. We already know how to extract a column using the $. To add a new column to a dataframe we use the $ on the left hand side of the assignment (->). For example,

case0301$log_rainfall <- log(case0301$Rainfall)

creates a column in case0301 that contains the log transformed values from the Rainfall column in case0301. Verify this by looking at the first few rows of the data.frame:

head(case0301)

We can then get summary statistics on the log rainfall.

mean(subset(case0301, Treatment == "Seeded")$log_rainfall)
mean(subset(case0301, Treatment == "Unseeded")$log_rainfall)
sd(subset(case0301, Treatment == "Seeded")$log_rainfall)
sd(subset(case0301, Treatment == "Unseeded")$log_rainfall)

Compare the averages of the logarithm of rainfall (above) to the logarithm of the average rainfall (below):

log(mean(subset(case0301, Treatment == "Seeded")$Rainfall))
log(mean(subset(case0301, Treatment == "Unseeded")$Rainfall))

The average of the log transformed data IS NOT the logarithm of the average of the raw data. This is true for our samples but it is also true at the population level too. The logarithm of the means of the population distributions of rainfall are not the same as the means of the population distributions of the logarithm of rainfall.

In lecture we will conduct a two sample t-test on the logarithm of Rainfall. This is as simple as:

t.test(log_rainfall ~ Treatment, data = case0301, var.equal = TRUE)
## 
## 	Two Sample t-test
## 
## data:  log_rainfall by Treatment
## t = -2.5444, df = 50, p-value = 0.01408
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.0466973 -0.2408651
## sample estimates:
## mean in group Unseeded   mean in group Seeded 
##               3.990406               5.134187

The tricky part is then backtransforming the estimates and intervals to the rainfall scale. It’s easiest if we save our t-test results into a new object, then pull out the important information, and backtransform it:

t_test <- t.test(log_rainfall ~ Treatment, data = case0301, var.equal = TRUE)
# the estimated means on the log scale
t_test$estimate
## mean in group Unseeded   mean in group Seeded 
##               3.990406               5.134187
# the difference in means on the log scale 
# the use of as.numeric just gets rid of a misleading label
as.numeric(t_test$estimate[1] - t_test$estimate[2])
## [1] -1.143781
# the backtransformed estimate
exp(as.numeric(t_test$estimate[1] - t_test$estimate[2]))
## [1] 0.318612
# the confidence interval on the log scale
t_test$conf.int
## [1] -2.0466973 -0.2408651
## attr(,"conf.level")
## [1] 0.95
# backtransform the confidence interval onto the raw scale
exp(t_test$conf.int)
## [1] 0.1291608 0.7859477
## attr(,"conf.level")
## [1] 0.95

We saw how to interpret these backtransformed estimates in lecture.