Stat 411/511

Lab 2

Jan 20th

Performing a randomization test in R

In class on Friday will we see a randomization test on the Creativity Study and you will perform one for homework 2. This lab will work through some of the code you will need to do this in R.

In class we’ll see an analogy between the criminal justice system and the statistical justice system. We can think of statistical testing as charging our data with a crime, and reaching a judgement by comparing the evidence provided by the data to that of innocent data.

There are three main steps in a randomization test:

  1. Calculate the test statistic on the data (the evidence for it’s crime)
  2. Calculate the randomization distribution of the test statistic by calculating the test statistic on either: all possible alternative group assignments, or a large random sample of possible alternative group assignments (the evidence on innocent data for the crime).
  3. Calculate the p-value by comparing the test statistic from the data to those from the randomization distribution (what’s the probability an innocent dataset would look as (or more) guilty than our data?)

Some new R things you’ll learn in this lab:

  • How to generate a random group assignment
  • Another way to subset
  • Writing a basic R function
  • Repeating something many times

The test statistic on the data

Following Section 1.3, the test-statistic we calculate is the difference in average score of the Intrinsic questionnaire group and the Extrinsic questionnaire group. Think of this as a summary of the evidence the data provides for the Intrinsic questionnaire increasing creativity scores.

intrinsic_scores <- subset(case0101, Treatment == "Intrinsic")$Score
extrinsic_scores <- subset(case0101, Treatment == "Extrinsic")$Score
test_statistic <- mean(intrinsic_scores) - mean(extrinsic_scores) 
test_statistic      # 4.14

The test-statistic on one random group assignment

Let’s assume the first group has n1 subjects, and the second group n2, and that this has been fixed beforehand. In the creativity study, the extrinsic group has 23 subjects, and the intrinsic group 24. There are a number of ways to randomly assign groups. The way I’ll demonstrate here, randomly reorders the data, then assigns the first n1 outcomes to the first group and the rest to the second group. (Aside: In a randomization test we assign the outcomes randomly to groups. If we were setting up a study we would be assigning subjects randomly to groups. The principle is the same but instead of reordering outcomes we would reorder some subject identifier).

The sample function, randomly draws numbers from a vector of numbers. By default it draws all of the numbers without replacement, so you get a random reordering. (There are other ways to use the function, read ?sample if you are interested).

For example, let x be the numbers from 1 to 10:

x <- 1:10
sample(x) # gives one reordering (check all the numbers are there)
sample(x) # run it again and you get a different reordering

For the creativity score data:

reordered_scores <- sample(case0101$Score)  # I want to keep this around
reordered_scores

To get a new grouping I let the first 23 subjects (in this new ordering) be assigned the Extrinsic questionnaire and the remaining 24 subjects be assigned the Intrinsic questionnaire. I could calculate the test statistic on this random group assignment with

mean(reordered_scores[24:47]) - mean(reordered_scores[1:23]) 

I get 0.678 but you will get something different because our group reassignments will be different. I’m using [ to subset in a new way: by position. Since my test statistic on the data was Intrinsic - Extrinsic, I need to mirror that here. This value is one test-statistic from the randomization distribution. Any difference between the two groups in this randomly assigned data is due to chance alone. Think of this test-statistic as one example of what innocent data might look like.

Getting lots more test-statistics from random group assignments

Now, we need to repeat the above 1000 times. The simplest way to do this is to encapsulate what we just did inside a function, then use the R function replicate to repeat the process many times.

First, let’s write a new function. Copy and paste the following into your R session (or into your editor and run it).

get_innocent_stat <- function(){    
  reordered_scores <- sample(case0101$Score)  
  mean(reordered_scores[24:47]) - mean(reordered_scores[1:23]) 
}

You won’t see any result. What we have done is create new R function called get_innocent_stat. The two central lines are identical to our code above but we’ve wrapped it in some syntax that tells R this is a new function. Run these lines and compare them:

get_innocent_stat
get_innocent_stat()
get_innocent_stat()     Every time we run  `get_innocent_stat()` we get another test-statistic from a random group assignment.  To do that 1000 times and keep the result we use the function `replicate`.  The first argument specifies how many times to run the second argument (and I'll save them in an object called innocents).

innocents <- replicate(1000, get_innocent_stat())

We can now compare the test-statistic from the data, to these test-statistics from innocent data.

qplot(innocents)
qplot(innocents, fill = innocents >= test_statistic) The plotting code looks a bit different to what you have seen because innocents is a vector, not a data.frame.  

If you want to add a line where the test-statistic is you can do:

qplot(innocents) + geom_vline(xintercept = test_statistic)

Calculating a p-value

The are a number of ways to do this, we want to count the number of innocents that look at least as guilty as our data. In terms of our R objects, we want to know how many numbers in innocents are greater than test_statistic

Some options:

sum(innocents >= test_statistic)/length(innocents)
table(innocents >= test_statistic) # you can also calculate it from this 
mean(innocents >= test_statistic)

I get 0.006, you might get something slightly different. This is a one-sided test. How would you do a two-sided test?

This file contains all the code from this lab.