Stat 411/511

Data Analysis 1

Due on canvas Nov 6th @ midnight

The grading rubric.

Some good examples from a previous year: example 1 example 2 example 3

The American Community Survey(ACS), is a large survey undertaken by the US Census Bureau in the years between decennial censuses. For this analysis project, you are given a subset of Public Use Micro Data sample for Oregon from 2013 that corresponds to households that contain opposite gender married couples (you may assume this is a simple random sample of such households in Oregon).

To get the data, with one row per household:

acs <- read.csv(url("http://stat511.cwick.co.nz/homeworks/acs_or.csv"))

You may find some steps easier with a reshaped dataset, that has one row for each person:

acs2 <- read.csv(url("http://stat511.cwick.co.nz/homeworks/acs_or2.csv"))

The variables I am providing are a subset of those available. You can find a summary of them below.

Your task is to use the tools we have covered so far to answer the following questions:

  • In households with no children, do husbands tend to be older than their wives? By how much?
  • Do households in houses built in the 1960s or earlier spend more on electricity, than those built in the 1970s or later? By how much? For this question, you’ll need to add a column to designate whether which category the house falls in: acs$old_house <- ifelse(acs$decade_built <= 1960, "1960's or earlier", "1970's or later")
  • (ST511 only) One other question, of your choosing, you could answer using this data and the tools we have learnt so far. You may subset as appropriate to answer your question.

Your report should include the following sections:

  • Introduction Give a brief overview of the data, a little bit of background and the questions of interest. Keep this concise, understandable to someone outside of this class, free of statistical jargon and to the point. You should provide a summary graphic of the data involved and some basic summary statistics.

  • Methods Describe your reasoning for the procedures you have chosen to answer the questions. State the assumptions of the procedures, and show or describe why you think they are reasonable assumptions in this case (or why the test might be robust to violations of the assumptions). Explain any changes, transformations or other modifications you make to the data.

  • Summary Provide a brief non-technical summary of your findings that answers the questions of interest (like the statistical summaries we have been writing). Make sure you include some indication of the scope of inference (Can population inference be made? To what population? Can causal inference be made?)

Variables

acs_or

column name Variable
household A unique ID number for each household
age_husband Age in years of the husband
age_wife Age in years of the husband
income_husband Total annual income of the husband, can include wages, retirement, interest, social security, self employment income.
income_wife Total annual income of the wife, as above
bedrooms Number of bedrooms in the home
electricity Monthly cost of electricity
gas Monthly cost of gas
number_children The number of children in the home
internet Does the home have internet access?
mode The way the household took the survey
own Do the residents own with or without a mortgage or rent?
language The primary language spoken in the home
decade_built The decade the home was built

acs_or2

column name Variable
household A unique ID number for each household
person wife or husband in this household
age Age in years of person
income Total annual income of person, can include wages, retirement, interest, social security, self employment income.
bedrooms Number of bedrooms in the home
electricity Monthly cost of electricity
gas Monthly cost of gas
number_children The number of children in the home
internet Does the home have internet access?
mode The way the household took the survey
own Do the residents own with or without a mortgage or rent?
language The primary language spoken in the home
decade_built The decade the home was built