Econ B2000, Statistics and Introduction to Econometrics

Kevin R Foster, the City College of New York, CUNY

For this class we’ll be using the statistical analysis program, R. The computer labs here on campus have the necessary software. If you have your own computer, download and install the program from R Project along with R-Studio. Depending on your particular device these details will be different so take it patiently and figure it out – there are sufficient online sources for help, to address any problem that arises.

But the best way to learn it is to just do it. How many times have you got a new game, just skipped reading the instructions, and learned by crashing it a few times? Below I give you some pointers about how to just do it, but they’re meant to be read then immediately done in real life, I’ll give you some commands to just copy-and-paste into the program.

The guide, An Introduction to R, suggests these commands to give a basic flavor of what’s done and hint at the power. Don’t worry too much about each step for now, this is just a glimpse to show that with a few commands you can do some relatively sophisticated estimation –- i.e. for a small cost you can get a large benefit. Copy them and paste into the “Console”.

x <- 1:20
w <- 1 + sqrt(x)/2
example1 <- data.frame(x=x, y= x + rnorm(x)*w)
attach(example1)

This creates x and y variables (where the rnorm command creates random numbers from a normal distribution), puts them into a data frame, then attaches that data frame so that R can use it in later calculations. (What’s a data frame? A way to collect together related data items.) Next some stats – create a linear model (that’s the “lm”) then a “lowess” nonparametric local regression, and plot the two estimations to compare. (Just copy and paste these, don’t worry about understanding them for now!)

fm <- lm(y ~ x)
summary(fm)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3452 -1.0085 -0.0382  0.9678  3.7738 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.01907    0.84519  -0.023    0.982    
## x            1.00359    0.07055  14.224 3.12e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.819 on 18 degrees of freedom
## Multiple R-squared:  0.9183, Adjusted R-squared:  0.9138 
## F-statistic: 202.3 on 1 and 18 DF,  p-value: 3.124e-11
lrf <- lowess(x, y)
plot(x, y)
lines(x, lrf$y)
abline(0, 1, lty=3)
abline(coef(fm))

detach()

Your own graph should look similar although with the random numbers, not exactly.

The final “detach” command just cleans up, it is the opposite of “attach”. For all of these commands, you can use the help box on R Studio. But as I said, don’t worry much about those commands for now, I’m not expecting you to become an R-ninja overnight.

With Some Data

This will go through some basic stats with R using the Census Bureau’s PUMS (Public Use Microdata Sample, from the American Community Survey, accessed from IPUMS). I have restricted the sample to contain only people living in the state of New York. The data should be in the same subdirectory as this file, so you should have already downloaded it from the class web page.

When you installed R and R-Studio, it probably created a directory/folder called R which is likely where the program points to by default. You can find this out with the command, C:/Users/Kevin/Documents/CCNY/data for classes/R_lecture1. At first you might just toss all of your R files in there without further organization. Later you might want to create subdirectories, in which case you’d want to change the directory by using commands of the form, setwd(“C:\Users\Kevin\Documents\R”) obviously with a different user name! (Windows uses the doubled backslash; Mac uses a slightly different notation.) For now, put this file and the data and other R stuff downloaded for the class into the default R directory.

rm(list = ls(all = TRUE)) # clear workspace

load("pums_NY.RData")
str(dat_pums_NY)
## 'data.frame':    196314 obs. of  56 variables:
##  $ Age              : num  43 45 33 57 52 26 83 87 21 45 ...
##  $ female           : num  1 0 0 0 1 0 1 0 0 0 ...
##  $ PERNUM           : num  1 2 1 1 2 3 1 2 1 1 ...
##  $ educ_nohs        : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ educ_hs          : num  0 0 1 1 0 0 1 1 0 0 ...
##  $ educ_smcoll      : num  0 1 0 0 0 0 0 0 1 0 ...
##  $ educ_as          : num  0 0 0 0 0 1 0 0 0 1 ...
##  $ educ_bach        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ educ_adv         : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ ANCESTR1D        : num  2610 511 880 7060 7060 7060 1440 1420 9020 321 ...
##  $ ANCESTR2D        : num  9990 9990 9990 9990 9990 9990 9990 9990 9990 9990 ...
##  $ immig            : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Hispanic         : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ Hisp_Mex         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hisp_PR          : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ Hisp_Cuban       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hisp_DomR        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ white            : num  1 1 1 0 0 0 1 1 0 1 ...
##  $ AfAm             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Amindian         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Asian            : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ race_oth         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Married          : num  1 1 0 1 1 0 1 1 0 1 ...
##  $ divwidsep        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ unmarried        : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ veteran          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_AnyHealthIns : num  1 1 1 0 0 0 1 1 1 1 ...
##  $ has_PvtHealthIns : num  1 1 1 0 0 0 1 1 1 1 ...
##  $ Commute_car      : num  1 1 1 0 0 0 0 0 1 1 ...
##  $ Commute_bus      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Commute_subway   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Commute_rail     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Commute_other    : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ below_povertyline: num  0 0 0 1 1 1 0 0 0 0 ...
##  $ below_150poverty : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ below_200poverty : num  0 0 0 1 1 1 1 1 0 0 ...
##  $ work_fullyr      : num  1 1 1 0 0 0 0 0 1 1 ...
##  $ income_total     : num  110000 39000 72000 0 7000 0 5800 19200 5200 61000 ...
##  $ income_wagesal   : num  110000 39000 72000 0 7000 0 0 0 5200 61000 ...
##  $ HH_income        : num  0 0 72000 7000 7000 7000 25000 25000 0 81000 ...
##  $ owner_cost       : num  2850 2850 0 0 0 ...
##  $ rent_cost        : num  0 0 430 900 900 900 0 0 0 0 ...
##  $ occ_dum          : num  1820 1550 4210 0 4520 0 0 0 4760 6440 ...
##  $ ind_dum          : num  7860 3390 770 0 8980 0 0 0 5170 770 ...
##  $ in_NYC           : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ PUMA             : num  3106 3106 100 4103 4103 ...
##  $ in_Bronx         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ in_Manhattan     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ in_StatenI       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ in_Brooklyn      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ in_Queens        : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ in_Westchester   : num  1 1 0 0 0 0 0 0 0 0 ...
##  $ in_Nassau        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ROOMS            : num  8 8 2 3 3 3 7 7 0 5 ...
##  $ BUILTYR2         : num  10 10 9 5 5 5 4 4 0 6 ...
##  $ UNITSSTR         : num  4 4 3 10 10 10 3 3 0 3 ...
attach(dat_pums_NY)

The command, str, is a good way to check out the data. In the next section I’ll explain more about the data and what the lines mean but for now Age is the person’s age in years and female is a logical 0/1 variable. So if you look at the output from str (below), see that the first person is a 43-year-old female, next is a 45-year-old male, etc.

##  num [1:196314] 43 45 33 57 52 26 83 87 21 45 ...
##  num [1:196314] 1 0 0 0 1 0 1 0 0 0 ...

You can also use the command, summary to find out about data.

summary(dat_pums_NY)
##       Age            female           PERNUM         educ_nohs     
##  Min.   : 0.00   Min.   :0.0000   Min.   : 1.000   Min.   :0.0000  
##  1st Qu.:20.00   1st Qu.:0.0000   1st Qu.: 1.000   1st Qu.:0.0000  
##  Median :41.00   Median :1.0000   Median : 2.000   Median :0.0000  
##  Mean   :40.58   Mean   :0.5191   Mean   : 2.122   Mean   :0.3166  
##  3rd Qu.:59.00   3rd Qu.:1.0000   3rd Qu.: 3.000   3rd Qu.:1.0000  
##  Max.   :94.00   Max.   :1.0000   Max.   :16.000   Max.   :1.0000  
##     educ_hs        educ_smcoll        educ_as          educ_bach     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.2196   Mean   :0.1576   Mean   :0.06641   Mean   :0.1374  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000  
##     educ_adv        ANCESTR1D      ANCESTR2D        immig        
##  Min.   :0.0000   Min.   :  10   Min.   :  10   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.: 511   1st Qu.:6801   1st Qu.:0.00000  
##  Median :0.0000   Median :2251   Median :9990   Median :0.00000  
##  Mean   :0.1025   Mean   :3952   Mean   :7668   Mean   :0.08103  
##  3rd Qu.:0.0000   3rd Qu.:9020   3rd Qu.:9990   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :9990   Max.   :9990   Max.   :1.00000  
##     Hispanic         Hisp_Mex          Hisp_PR         Hisp_Cuban      
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.000000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.000000  
##  Median :0.0000   Median :0.00000   Median :0.0000   Median :0.000000  
##  Mean   :0.1353   Mean   :0.01672   Mean   :0.0454   Mean   :0.003107  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.000000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.000000  
##    Hisp_DomR           white             AfAm           Amindian       
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000000  
##  Median :0.00000   Median :1.0000   Median :0.0000   Median :0.000000  
##  Mean   :0.02642   Mean   :0.7093   Mean   :0.1347   Mean   :0.003994  
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.000000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.000000  
##      Asian            race_oth    Married         divwidsep     
##  Min.   :0.00000   Min.   :0   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.00000   Median :0   Median :0.0000   Median :0.0000  
##  Mean   :0.07088   Mean   :0   Mean   :0.3937   Mean   :0.1515  
##  3rd Qu.:0.00000   3rd Qu.:0   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :1.00000   Max.   :0   Max.   :1.0000   Max.   :1.0000  
##    unmarried         veteran        has_AnyHealthIns has_PvtHealthIns
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.00000   Median :1.0000   Median :1.0000  
##  Mean   :0.4548   Mean   :0.05515   Mean   :0.9082   Mean   :0.6706  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
##   Commute_car      Commute_bus      Commute_subway     Commute_rail    
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.2892   Mean   :0.02343   Mean   :0.06291   Mean   :0.01165  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##  Commute_other     below_povertyline below_150poverty below_200poverty
##  Min.   :0.00000   Min.   :0.0000    Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000    1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.0000    Median :0.0000   Median :0.0000  
##  Mean   :0.05029   Mean   :0.1434    Mean   :0.2238   Mean   :0.3012  
##  3rd Qu.:0.00000   3rd Qu.:0.0000    3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :1.00000   Max.   :1.0000    Max.   :1.0000   Max.   :1.0000  
##   work_fullyr     income_total     income_wagesal     HH_income     
##  Min.   :0.000   Min.   : -11300   Min.   :     0   Min.   :-12800  
##  1st Qu.:0.000   1st Qu.:      0   1st Qu.:     0   1st Qu.:     0  
##  Median :0.000   Median :  14000   Median :     0   Median : 21400  
##  Mean   :0.376   Mean   :  31033   Mean   : 22646   Mean   : 30061  
##  3rd Qu.:1.000   3rd Qu.:  40000   3rd Qu.: 30000   3rd Qu.: 55000  
##  Max.   :1.000   Max.   :1125000   Max.   :504000   Max.   : 99990  
##    owner_cost      rent_cost         occ_dum        ind_dum    
##  Min.   :    0   Min.   :   0.0   Min.   :   0   Min.   :   0  
##  1st Qu.:    0   1st Qu.:   0.0   1st Qu.:   0   1st Qu.:   0  
##  Median :  702   Median :   0.0   Median :1030   Median :3980  
##  Mean   : 1148   Mean   : 336.2   Mean   :2474   Mean   :3893  
##  3rd Qu.: 1825   3rd Qu.: 550.0   3rd Qu.:4700   3rd Qu.:7860  
##  Max.   :12248   Max.   :3700.0   Max.   :9920   Max.   :9920  
##      in_NYC            PUMA         in_Bronx        in_Manhattan    
##  Min.   :0.0000   Min.   : 100   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:1500   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.0000   Median :3107   Median :0.00000   Median :0.00000  
##  Mean   :0.3571   Mean   :2701   Mean   :0.05582   Mean   :0.04695  
##  3rd Qu.:1.0000   3rd Qu.:3901   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :4114   Max.   :1.00000   Max.   :1.00000  
##    in_StatenI       in_Brooklyn       in_Queens      in_Westchester   
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   :0.02036   Mean   :0.1254   Mean   :0.1086   Mean   :0.04385  
##  3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##    in_Nassau          ROOMS           BUILTYR2         UNITSSTR    
##  Min.   :0.0000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.00  
##  1st Qu.:0.0000   1st Qu.: 4.000   1st Qu.: 1.000   1st Qu.: 3.00  
##  Median :0.0000   Median : 6.000   Median : 3.000   Median : 3.00  
##  Mean   :0.0691   Mean   : 5.825   Mean   : 3.442   Mean   : 4.34  
##  3rd Qu.:0.0000   3rd Qu.: 7.000   3rd Qu.: 5.000   3rd Qu.: 6.00  
##  Max.   :1.0000   Max.   :19.000   Max.   :17.000   Max.   :10.00
print(NN_obs <- length(Age))
## [1] 196314

So this shows that there are 196,314 people in this dataset.

Next we compare the average age of the men and the women in the data,

summary(Age[female == 1])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   21.00   43.00   41.88   60.00   94.00
summary(Age[!female])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   19.00   39.00   39.19   57.00   94.00

The female dummy variable is a logical term, with a zero or one for false or true. The comparison is between those who have the variable female=1 (i.e. women) and those not female=1 (so logical not, denoted with the “!” symbol, i.e. men). I know, you can – and people do! – worry that this binary classification for gender misses some people; government statistics are just not there yet.

So women are, on average, a bit older, with an average age of 41.9 compared with 39.2 for men. You might wonder (if you were to begin to think like a statistician) whether that is a big difference – hold onto that thought! Alternately you can use

mean(Age[female == 1])
## [1] 41.87525
sd(Age[female == 1])
## [1] 23.87617
mean(Age[!female])
## [1] 39.18672
sd(Age[!female])
## [1] 22.90436

to get mean and standard deviation of each. Later you might encounter cases where you want more complicated dummy variables and want to use logical relations “and” “or” “not” (the symbols “&”, “|”, “!”) or the “>=” or multiplication or division.

As you’re going along, if you’re copying-and-pasting then you might not have had trouble, but if you’re typing then you’ve probably realized that R is persnickety –- the variable Age is not equivalent to a variable AGE nor age nor aGe… You might be wondering if there’s an easier way than copy-paste; there is. If you highlight text in the Source pane and then hit CTRL-Enter, that will run the block of code. Alt if you click the “Knit HTML” button then that will recreate the webpage as well as run all of the commands.

At the beginning you’ll be just copying commands that I give you but as you start making changes, you have to understand what’s going on.

Remember to save your work. The computers in the lab might wipe the memory clean when you log off so back up your files. Either online (email it to yourself or upload to Google Drive or iCloud or whatever) and/or use a USB drive. But be careful – you have the option of saving your code and/or your workspace; it is usually better to save the code. For example, your code might load some data (at this stage, usually data that I’ve provided). If you make changes to this data during your session, you might not want to restart with those changes. If you just save your code and re-run the program, you will start fresh. If instead you save the workspace, then you end up saving all of your scraps and changes – which eventually start to build up. So only save your workspace if you’ve run a large chunk of code that you don’t want to have to later re-run. (This is even more true if you “attach” data frames without later “detach”-ing them!)

To create files like the one you’re reading now, use “File  New File  R Markdown…” Give it a title and save the file with some name that you’ll remember and the extension, “.Rmd”. Keep everything in the same subdirectory.

There is a bigger philosophy behind this: the idea of “reproducible research,” that if you share your .Rmd file then anybody else can run the exact same program and get the exact same results. It’s useful in class since it’s an easy way to submit homework or share work with your study group.

Variable Coding

Some of the PUMS variables here have a natural interpretation, for instance Age is measured in years. Actually even this has a bit of a twist.

hist(Age)

A simple histogram doesn’t show it, but a fancier version (a nonparametric kernel density), will show issues:

plot(density(Age[runif(NN_obs) < 0.1], bw = "sj"))

Note, for you geeks, that if you tried this with the whole sample then the 200,000 observations would slow the computer too much, so I’ve randomly selected 10% (that’s the runif(NN_obs)<0.1 portion).

The kernel density is like a histogram with much smaller bins; in this case it shows a bit of weirdness particularly in the right, where it looks like there are suddenly a bunch of people who are 94 but nobody in between. This is due to a coding choice by the Census, where really old people are just labeled as “94” (top-coding). So if you were to get finicky (and every good statistician is!) you might go back to the calculations of averages previously and modify them all like this, to select just those who are female and who are coded as having age less than 90. And recall that topcoding wouldn’t change the median values calculated before, which is a point in favor of that statistic

mean(Age[ (female == 1) & (Age<90) ]) 
## [1] 41.13141

You go make those other changes, figure out how top-coding changes the calculations of average age by gender – I’ll wait right here…

Variable Coding Again

So we were saying that some variables, like Age - ahem! – have a natural interpretation as a number.

Others are logical variables (called dummies) like female, Hispanic, or married – there is a yes/no answer that is coded 1/0. Note that if you’re creating these on your own it’s good to give names that have that sort of yes/no answer, so a variable named ‘female’ is better than one named ‘gender’ where you’d have to remember who are true and who are false.

Many variables, like PUMA, have no natural explanation at all. Here are the first codes,

str(PUMA)
##  num [1:196314] 3106 3106 100 4103 4103 ...

You have to go to the codebook (or, in this case, the file pums_initial_recoding.r) to find out that PUMA is “Public Use Microdata Area” where 3801 codes for Washington Heights/Inwood, 3802 is Hamilton Heights/Manhattanville/West Harlem, etc. The program will happily calculate the average value for PUMA (type in ‘r mean(PUMA)’ and see for yourself!) but this is a meaningless value – wtf is the average neighborhood code value!? If you want to select just people living in a particular neighborhood then you’d have to look at the list below.

PUMA Neighborhood
3701 NYC-Bronx CD 8–Riverdale, Fieldston & Kingsbridge
3702 NYC-Bronx CD 12–Wakefield, Williamsbridge & Woodlawn
3703 NYC-Bronx CD 10–Co-op City, Pelham Bay & Schuylerville
3704 NYC-Bronx CD 11–Pelham Parkway, Morris Park & Laconia
3705 NYC-Bronx CD 3 & 6–Belmont, Crotona Park East & East Tremont
3706 NYC-Bronx CD 7–Bedford Park, Fordham North & Norwood
3707 NYC-Bronx CD 5–Morris Heights, Fordham South & Mount Hope
3708 NYC-Bronx CD 4–Concourse, Highbridge & Mount Eden
3709 NYC-Bronx CD 9–Castle Hill, Clason Point & Parkchester
3710 NYC-Bronx CD 1 & 2–Hunts Point, Longwood & Melrose
3801 NYC-Manhattan CD 12–Washington Heights, Inwood & Marble Hill
3802 NYC-Manhattan CD 9–Hamilton Heights, Manhattanville & West Harlem
3803 NYC-Manhattan CD 10–Central Harlem
3804 NYC-Manhattan CD 11–East Harlem
3805 NYC-Manhattan CD 8–Upper East Side
3806 NYC-Manhattan CD 7–Upper West Side & West Side
3807 NYC-Manhattan CD 4 & 5–Chelsea, Clinton & Midtown Business District
3808 NYC-Manhattan CD 6–Murray Hill, Gramercy & Stuyvesant Town
3809 NYC-Manhattan CD 3–Chinatown & Lower East Side
3810 NYC-Manhattan CD 1 & 2–Battery Park City, Greenwich Village & Soho
3901 NYC-Staten Island CD 3–Tottenville, Great Kills & Annadale
3902 NYC-Staten Island CD 2–New Springville & South Beach
3903 NYC-Staten Island CD 1–Port Richmond, Stapleton & Mariner’s Harbor
4001 NYC-Brooklyn CD 1–Greenpoint & Williamsburg
4002 NYC-Brooklyn CD 4—Bushwick
4003 NYC-Brooklyn CD 3–Bedford-Stuyvesant
4004 NYC-Brooklyn CD 2–Brooklyn Heights & Fort Greene
4005 NYC-Brooklyn CD 6–Park Slope, Carroll Gardens & Red Hook
4006 NYC-Brooklyn CD 8–Crown Heights North & Prospect Heights
4007 NYC-Brooklyn CD 16–Brownsville & Ocean Hill
4008 NYC-Brooklyn CD 5–East New York & Starrett City
4009 NYC-Brooklyn CD 18–Canarsie & Flatlands
4010 NYC-Brooklyn CD 17–East Flatbush, Farragut & Rugby
4011 NYC-Brooklyn CD 9–Crown Heights South, Prospect Lefferts & Wingate
4012 NYC-Brooklyn CD 7–Sunset Park & Windsor Terrace
4013 NYC-Brooklyn CD 10–Bay Ridge & Dyker Heights
4014 NYC-Brooklyn CD 12–Borough Park, Kensington & Ocean Parkway
4015 NYC-Brooklyn CD 14–Flatbush & Midwood
4016 NYC-Brooklyn CD 15–Sheepshead Bay, Gerritsen Beach & Homecrest
4017 NYC-Brooklyn CD 11–Bensonhurst & Bath Beach
4018 NYC-Brooklyn CD 13–Brighton Beach & Coney Island
4101 NYC-Queens CD 1–Astoria & Long Island City
4102 NYC-Queens CD 3–Jackson Heights & North Corona
4103 NYC-Queens CD 7–Flushing, Murray Hill & Whitestone
4104 NYC-Queens CD 11–Bayside, Douglaston & Little Neck
4105 NYC-Queens CD 13–Queens Village, Cambria Heights & Rosedale
4106 NYC-Queens CD 8–Briarwood, Fresh Meadows & Hillcrest
4107 NYC-Queens CD 4–Elmhurst & South Corona
4108 NYC-Queens CD 6–Forest Hills & Rego Park
4109 NYC-Queens CD 2–Sunnyside & Woodside
4110 NYC-Queens CD 5–Ridgewood, Glendale & Middle Village
4111 NYC-Queens CD 9–Richmond Hill & Woodhaven
4112 NYC-Queens CD 12–Jamaica, Hollis & St. Albans
4113 NYC-Queens CD 10–Howard Beach & Ozone Park
4114 NYC-Queens CD 14–Far Rockaway, Breezy Point & Broad Channel

Now you’re probably thinking, isn’t there some easier way? Yes there is. R has variables called “factors” that join together the long list of codes with a separate file telling what those codes mean. Later when we do further statistics, R will know how to appropriately treat these factors.

PUMA <- as.factor(PUMA)
female <- as.factor(female)

I will leave you to worry over the recoding of the other variables, because it’s good for the soul. I will show you 2 ways - the quick and dirty way, and the fancy correct way.

First the quick and dirty way.

print(levels(female))
## [1] "0" "1"
levels(female) <- c("male","female")

Well, ways,

educ_indx <- factor((educ_nohs + 2*educ_hs + 3*educ_smcoll + 4*educ_as + 5*educ_bach + 6*educ_adv), levels=c(1,2,3,4,5,6),labels = c("No HS","HS","SmColl","AS","Bach","Adv"))

These just type in the levels. But for things like PUMA, it could be a long list and might not even match every one.

But first a bit of a detour, to mention how to use packages. R depends crucially on “packages” – that’s the whole reason that the open-source works. Some statistician invents a cool new technique, then writes up the code in R and makes it available. If you used a commercial program you’d have to wait a decade for them to update it; in R it’s here now. Also if somebody hacks a nicer or easier way to do stuff, they write it up.

From R-Studio, click “Tools” then “Install Packages…” and tell it to install the package, “plyr”. Then the next piece of code, library, tells the program that you want to use commands from this package.

levels_n <- read.csv("PUMA_levels.csv")
levels_orig <- levels(PUMA) 
library(plyr)
levels_new <- join(data.frame(levels_orig),data.frame(levels_n))
## Joining by: levels_orig
levels(PUMA) <- levels_new$New_Level

Those commands read in a little csv file that I had made, with all of the PUMA codes, then matches the old codes with the new complete text.

Now R will do the summary differently when it knows the variable is a factor,

summary(female)
##   male female 
##  94400 101914
summary(PUMA)
##                   NYC-Bronx CD 8--Riverdale, Fieldston & Kingsbridge 
##                                                                  915 
##                NYC-Bronx CD 12--Wakefield, Williamsbridge & Woodlawn 
##                                                                 1251 
##              NYC-Bronx CD 10--Co-op City, Pelham Bay & Schuylerville 
##                                                                  780 
##               NYC-Bronx CD 11--Pelham Parkway, Morris Park & Laconia 
##                                                                 1230 
##        NYC-Bronx CD 3 & 6--Belmont, Crotona Park East & East Tremont 
##                                                                 1358 
##                NYC-Bronx CD 7--Bedford Park, Fordham North & Norwood 
##                                                                  804 
##           NYC-Bronx CD 5--Morris Heights, Fordham South & Mount Hope 
##                                                                 1047 
##                   NYC-Bronx CD 4--Concourse, Highbridge & Mount Eden 
##                                                                  908 
##              NYC-Bronx CD 9--Castle Hill, Clason Point & Parkchester 
##                                                                 1346 
##                  NYC-Bronx CD 1 & 2--Hunts Point, Longwood & Melrose 
##                                                                 1320 
##        NYC-Manhattan CD 12--Washington Heights, Inwood & Marble Hill 
##                                                                 1133 
##   NYC-Manhattan CD 9--Hamilton Heights, Manhattanville & West Harlem 
##                                                                  813 
##                                  NYC-Manhattan CD 10--Central Harlem 
##                                                                  882 
##                                     NYC-Manhattan CD 11--East Harlem 
##                                                                  839 
##                                  NYC-Manhattan CD 8--Upper East Side 
##                                                                 1074 
##                      NYC-Manhattan CD 7--Upper West Side & West Side 
##                                                                  872 
## NYC-Manhattan CD 4 & 5--Chelsea, Clinton & Midtown Business District 
##                                                                  912 
##          NYC-Manhattan CD 6--Murray Hill, Gramercy & Stuyvesant Town 
##                                                                  683 
##                      NYC-Manhattan CD 3--Chinatown & Lower East Side 
##                                                                 1048 
##  NYC-Manhattan CD 1 & 2--Battery Park City, Greenwich Village & Soho 
##                                                                  960 
##          NYC-Staten Island CD 3--Tottenville, Great Kills & Annadale 
##                                                                 1287 
##                NYC-Staten Island CD 2--New Springville & South Beach 
##                                                                 1146 
##  NYC-Staten Island CD 1--Port Richmond, Stapleton & Mariner's Harbor 
##                                                                 1564 
##                         NYC-Brooklyn CD 1--Greenpoint & Williamsburg 
##                                                                 1390 
##                                           NYC-Brooklyn CD 4—Bushwick 
##                                                                 1073 
##                                NYC-Brooklyn CD 3--Bedford-Stuyvesant 
##                                                                 1181 
##                    NYC-Brooklyn CD 2--Brooklyn Heights & Fort Greene 
##                                                                 1284 
##            NYC-Brooklyn CD 6--Park Slope, Carroll Gardens & Red Hook 
##                                                                 1053 
##            NYC-Brooklyn CD 8--Crown Heights North & Prospect Heights 
##                                                                  989 
##                         NYC-Brooklyn CD 16--Brownsville & Ocean Hill 
##                                                                 1041 
##                     NYC-Brooklyn CD 5--East New York & Starrett City 
##                                                                 1439 
##                             NYC-Brooklyn CD 18--Canarsie & Flatlands 
##                                                                 2419 
##                  NYC-Brooklyn CD 17--East Flatbush, Farragut & Rugby 
##                                                                 1335 
##  NYC-Brooklyn CD 9--Crown Heights South, Prospect Lefferts & Wingate 
##                                                                  849 
##                     NYC-Brooklyn CD 7--Sunset Park & Windsor Terrace 
##                                                                 1384 
##                        NYC-Brooklyn CD 10--Bay Ridge & Dyker Heights 
##                                                                 1473 
##         NYC-Brooklyn CD 12--Borough Park, Kensington & Ocean Parkway 
##                                                                 1847 
##                               NYC-Brooklyn CD 14--Flatbush & Midwood 
##                                                                 1400 
##      NYC-Brooklyn CD 15--Sheepshead Bay, Gerritsen Beach & Homecrest 
##                                                                 1625 
##                         NYC-Brooklyn CD 11--Bensonhurst & Bath Beach 
##                                                                 1982 
##                    NYC-Brooklyn CD 13--Brighton Beach & Coney Island 
##                                                                  850 
##                          NYC-Queens CD 1--Astoria & Long Island City 
##                                                                 1712 
##                      NYC-Queens CD 3--Jackson Heights & North Corona 
##                                                                 1321 
##                  NYC-Queens CD 7--Flushing, Murray Hill & Whitestone 
##                                                                 2162 
##                  NYC-Queens CD 11--Bayside, Douglaston & Little Neck 
##                                                                 1315 
##         NYC-Queens CD 13--Queens Village, Cambria Heights & Rosedale 
##                                                                 2132 
##                NYC-Queens CD 8--Briarwood, Fresh Meadows & Hillcrest 
##                                                                 1293 
##                             NYC-Queens CD 4--Elmhurst & South Corona 
##                                                                  955 
##                            NYC-Queens CD 6--Forest Hills & Rego Park 
##                                                                  973 
##                                NYC-Queens CD 2--Sunnyside & Woodside 
##                                                                 1101 
##                NYC-Queens CD 5--Ridgewood, Glendale & Middle Village 
##                                                                 1821 
##                           NYC-Queens CD 9--Richmond Hill & Woodhaven 
##                                                                 1803 
##                       NYC-Queens CD 12--Jamaica, Hollis & St. Albans 
##                                                                 2427 
##                          NYC-Queens CD 10--Howard Beach & Ozone Park 
##                                                                 1344 
##         NYC-Queens CD 14--Far Rockaway, Breezy Point & Broad Channel 
##                                                                  966 
##                                                                 NA's 
##                                                               126203
summary(educ_indx)
##  No HS     HS SmColl     AS   Bach    Adv 
##  62156  43109  30930  13038  26968  20113

To find mean and standard deviation by neighborhood, you could use something like this,

ddply(dat_pums_NY, .(PUMA), summarize, mean = round(mean(Age), 2), sd = round(sd(Age), 2))
##     PUMA  mean    sd
## 1    100 40.55 23.95
## 2    200 42.77 23.21
## 3    300 42.71 23.13
## 4    401 41.54 23.67
## 5    402 42.63 24.53
## 6    403 42.17 23.71
## 7    500 39.32 23.30
## 8    600 40.62 23.45
## 9    701 37.00 23.40
## 10   702 40.23 22.60
## 11   703 44.47 23.67
## 12   704 43.62 23.39
## 13   800 42.03 23.66
## 14   901 43.29 24.46
## 15   902 36.54 22.64
## 16   903 35.69 22.53
## 17   904 44.86 24.06
## 18   905 41.34 23.07
## 19   906 41.02 23.93
## 20  1000 42.58 23.02
## 21  1101 42.90 23.62
## 22  1102 42.48 23.48
## 23  1201 43.26 23.14
## 24  1202 41.09 24.60
## 25  1203 42.27 23.20
## 26  1204 44.42 23.81
## 27  1205 38.78 23.95
## 28  1206 37.67 22.35
## 29  1207 42.84 23.36
## 30  1300 41.62 22.93
## 31  1400 42.27 24.46
## 32  1500 40.68 23.39
## 33  1600 43.47 24.79
## 34  1700 41.97 23.73
## 35  1801 40.84 22.36
## 36  1802 41.45 23.19
## 37  1900 41.20 22.65
## 38  2001 39.51 23.19
## 39  2002 42.19 23.50
## 40  2100 45.62 23.89
## 41  2201 41.26 24.11
## 42  2202 43.50 23.29
## 43  2203 44.76 23.90
## 44  2300 35.66 21.84
## 45  2401 43.27 24.18
## 46  2402 43.47 23.49
## 47  2500 40.64 23.90
## 48  2600 42.86 23.88
## 49  2701 43.25 23.15
## 50  2702 41.83 23.07
## 51  2801 41.38 23.08
## 52  2802 40.01 23.22
## 53  2901 40.07 23.48
## 54  2902 40.37 23.01
## 55  2903 35.10 23.54
## 56  3001 41.72 23.46
## 57  3002 41.43 24.10
## 58  3003 30.50 23.62
## 59  3101 42.62 23.13
## 60  3102 40.34 23.79
## 61  3103 41.53 24.05
## 62  3104 38.20 24.19
## 63  3105 42.67 24.42
## 64  3106 41.20 23.97
## 65  3107 41.17 24.04
## 66  3201 42.61 25.23
## 67  3202 43.93 24.26
## 68  3203 42.75 23.44
## 69  3204 42.18 23.72
## 70  3205 42.30 23.95
## 71  3206 38.88 22.93
## 72  3207 43.33 23.17
## 73  3208 42.13 23.88
## 74  3209 42.17 23.14
## 75  3210 43.27 23.62
## 76  3211 41.91 23.63
## 77  3212 44.48 24.68
## 78  3301 40.79 23.67
## 79  3302 43.54 23.73
## 80  3303 42.63 24.01
## 81  3304 39.64 23.16
## 82  3305 47.21 24.05
## 83  3306 40.03 23.10
## 84  3307 40.95 23.46
## 85  3308 39.40 22.60
## 86  3309 41.91 23.47
## 87  3310 36.43 22.30
## 88  3311 41.30 23.97
## 89  3312 42.70 23.79
## 90  3313 39.83 23.89
## 91  3701 43.95 24.62
## 92  3702 37.54 23.02
## 93  3703 43.98 24.55
## 94  3704 40.02 23.51
## 95  3705 32.33 21.78
## 96  3706 35.52 22.54
## 97  3707 33.56 21.53
## 98  3708 34.17 21.92
## 99  3709 37.12 23.35
## 100 3710 33.19 21.30
## 101 3801 41.55 22.54
## 102 3802 34.82 21.01
## 103 3803 37.39 21.14
## 104 3804 38.27 22.19
## 105 3805 43.68 23.42
## 106 3806 44.59 23.39
## 107 3807 39.93 19.40
## 108 3808 41.10 20.83
## 109 3809 40.42 21.90
## 110 3810 37.48 20.07
## 111 3901 41.78 23.31
## 112 3902 42.60 24.31
## 113 3903 38.90 23.35
## 114 4001 33.22 20.73
## 115 4002 33.79 20.74
## 116 4003 33.95 21.87
## 117 4004 37.61 20.69
## 118 4005 37.34 20.79
## 119 4006 38.39 21.91
## 120 4007 34.20 22.56
## 121 4008 35.74 22.37
## 122 4009 39.06 22.87
## 123 4010 40.11 23.41
## 124 4011 39.05 23.90
## 125 4012 35.33 21.28
## 126 4013 41.22 23.82
## 127 4014 33.95 24.41
## 128 4015 39.55 23.84
## 129 4016 41.79 24.36
## 130 4017 41.23 23.46
## 131 4018 46.05 24.51
## 132 4101 39.62 20.60
## 133 4102 37.86 22.47
## 134 4103 43.54 23.85
## 135 4104 45.08 23.46
## 136 4105 41.86 23.53
## 137 4106 40.34 23.14
## 138 4107 38.24 21.92
## 139 4108 44.41 23.32
## 140 4109 39.38 21.51
## 141 4110 40.86 22.91
## 142 4111 38.23 21.78
## 143 4112 39.17 23.33
## 144 4113 37.64 22.01
## 145 4114 38.85 24.77

Although tapply would also work fine.

You could also use table,

table(educ_indx,female)
##          female
## educ_indx  male female
##    No HS  31376  30780
##    HS     20783  22326
##    SmColl 15063  15867
##    AS      5471   7567
##    Bach   12812  14156
##    Adv     8895  11218
xtabs(~educ_indx + female)
##          female
## educ_indx  male female
##    No HS  31376  30780
##    HS     20783  22326
##    SmColl 15063  15867
##    AS      5471   7567
##    Bach   12812  14156
##    Adv     8895  11218

Want proportions instead of counts?

prop.table(table(educ_indx,female))
##          female
## educ_indx       male     female
##    No HS  0.15982559 0.15678963
##    HS     0.10586611 0.11372597
##    SmColl 0.07672912 0.08082460
##    AS     0.02786862 0.03854539
##    Bach   0.06526279 0.07210897
##    Adv    0.04531006 0.05714315

Remember prop.table later when we do marginals.

In general, R is very flexible so there are often many different ways to get the same answer. There are some people who love to debate which is best. For now just worry about learning at least one way. Later on you can go back and refine your techniques.

For example, sometimes attaching a dataset makes things easier. But as you get more advanced you might find it better to include the dataset name inside the function. There are advantages and disadvantages each way and some of the intro texts suggest one or the other.

If you do a lot of analysis on a particular subgroup, it might be worthwhile to create a subset of that group, so that you don’t have to always add on logical conditions. These two sets of expressions, looking at “prime-age” people, get the same results:

mean(educ_nohs[(Age >= 25)&(Age <= 55)])
## [1] 0.1176959
mean(educ_hs[(Age >= 25)&(Age <= 55)])
## [1] 0.2455135
mean(educ_smcoll[(Age >= 25)&(Age <= 55)])
## [1] 0.1727465
mean(educ_as[(Age >= 25)&(Age <= 55)])
## [1] 0.1012049
mean(educ_bach[(Age >= 25)&(Age <= 55)])
## [1] 0.2126975
mean(educ_adv[(Age >= 25)&(Age <= 55)])
## [1] 0.1501417
# alternatively
restrict1 <- as.logical((Age >= 25)&(Age <= 55))
dat_age_primeage <- subset(dat_pums_NY, restrict1)

detach()
attach(dat_age_primeage)
## The following objects are masked _by_ .GlobalEnv:
## 
##     female, PUMA
## 
## The following objects are masked from dat_pums_NY:
## 
##     AfAm, Age, Amindian, ANCESTR1D, ANCESTR2D, Asian,
##     below_150poverty, below_200poverty, below_povertyline,
##     BUILTYR2, Commute_bus, Commute_car, Commute_other,
##     Commute_rail, Commute_subway, divwidsep, educ_adv, educ_as,
##     educ_bach, educ_hs, educ_nohs, educ_smcoll, female,
##     has_AnyHealthIns, has_PvtHealthIns, HH_income, Hisp_Cuban,
##     Hisp_DomR, Hisp_Mex, Hisp_PR, Hispanic, immig, in_Bronx,
##     in_Brooklyn, in_Manhattan, in_Nassau, in_NYC, in_Queens,
##     in_StatenI, in_Westchester, income_total, income_wagesal,
##     ind_dum, Married, occ_dum, owner_cost, PERNUM, PUMA, race_oth,
##     rent_cost, ROOMS, UNITSSTR, unmarried, veteran, white,
##     work_fullyr
mean(educ_nohs)
## [1] 0.1176959
mean(educ_hs)
## [1] 0.2455135
mean(educ_smcoll)
## [1] 0.1727465
mean(educ_as)
## [1] 0.1012049
mean(educ_bach)
## [1] 0.2126975
mean(educ_adv)
## [1] 0.1501417
detach()

So you detach the original data frame and instead attach the restricted version. Then any subsequent analysis would be just done on that subset. Just remember that you’ve done this (again, this is a good reason to save the commands in a program so you can look back) otherwise you’ll wonder why you suddenly don’t have any kids in the sample!

Why All These Details?

You might be tired and bored by these details, but note that there are actually important choices to be made here, even in simply defining variables. Take the fraught American category of “race”. This data has a variable, RACED, showing how people chose to classify themselves, as ‘White,’ ‘Black,’ ‘American Indian or Alaska Native,’ Asian, various combinations, and many more codes.

Suppose you wanted to find out how many Asians are in a particular population. You could count how many people identify themselves as Asian only; you could count how many people identify as Asian in any combination. Sometimes the choice is irrelevant; sometimes it can skew the final results (e.g. the question in some areas, are there more African-Americans or more Hispanics?).

Again, there’s no “right” way to do it because there’s no science in this peculiar-but-popular concept of “race”. People’s conceptions of themselves are fuzzy and complicated; these measures are approximations.

Basics of government race/ethnicity classification

The US government asks questions about people’s race and ethnicity. These categories are social constructs, which is a fancy way of pointing out that they are based on people’s own views of themselves (influenced by how we think that other people think of us…). Currently the standard classification asks people separately about their “race” and “ethnicity” where people can pick labels from each category in any combination.

The “race” categories include: “White alone,” “Black or African-American alone,” “American Indian alone,” “Alaska Native alone,” “American Indian and Alaska Native tribes specified; or American Indian or Alaska native, not specified and no other race,” “Asian alone,” “Native Hawaiian and other Pacific Islander alone,” “Some other race alone,” or “Two or more major race groups.” Then the supplemental race categories offer more detail.

These are a peculiar combination of very general (well over 40% of the world’s population is “Asian”) and very specific (“Alaska Native alone”) representing a peculiar history of popular attitudes in the US. Only in the 2000 Census did they start to classify people in mixed races. If you were to go back to historical US Censuses from more than a century ago, you would find that the category “race” included separate entries for Irish and French and various other nationalities. Stephen J Gould has a fascinating book, The Mismeasure of Man, discussing how early scientific classifications of humans tried to “prove” which nationalities/races/groups were the smartest. Ta-Nehisi Coates notes, “racism invented race in America.”

Note that “Hispanic” is not “race” but rather ethnicity (includes various other labels such as Spanish, Latino, etc.). So a respondent could choose “Hispanic” and any race category – some choose “White,” some choose “Black,” some might be combined with any other of those complicated racial categories.

If you wanted to create a variable for those who report themselves as African-American and Hispanic, you’d use the expression (AfAm == 1) & (Hispanic == 1); sometimes stats report for non-Hispanic whites so (white == 1) & (Hispanic != 1). You can create your own classifications depending on what questions you’re investigating.

The Census Bureau gives more information here.

All of these racial categories make some people uneasy: is the government encouraging racism by recognizing these classifications? Some other governments choose not to collect race data. But that doesn’t mean that there are no differences, only that the government doesn’t choose to measure any of these differences. In the US, government agencies such as the Census and BLS don’t generally collect data on religion.

Re-Coding complicated variables from initial data

If we want more combinations of variables then we create those. Usually a statistical analysis spends a lot of time doing this sort of housekeeping – dull but necessary. It has a variety of names: data carpentry, data munging…

Educational attainment is also classified with complicated codes: the original data has code 63 to mean high school diploma, 64 for a GED, 65 for less than a year of college, etc. I have transformed them into a series of dummy variables, zero/one variables for whether a person has no high school diploma, just a high school diploma, some college (but no degree), an associate’s degree, a bachelor’s degree, or an advanced degree. As with race, there is no rule that you must always do it thus. Your own analysis might be different. (Is it valid to lump together everybody who lacks a high school diploma, no matter whether they completed just first grade or up to 12th?)

That’s the whole point of learning to do the data work for yourself: you can see all of the little decisions that go into creating a conclusion. Some conclusions might be fragile so a tiny decision about coding could change everything; other conclusions are robust to deviations. You must find out.

Time Series in R

In this course we will do very little time series analysis but for now I will demonstrate some of the stuff that R can easily do.

First install some packages (you must click “Tools” then “Install Packages…” for each of these):

library(zoo)
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(lattice)
library(latticeExtra)
## Loading required package: RColorBrewer
library(gdata)
## gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.
## 
## gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.
## 
## Attaching package: 'gdata'
## 
## The following object is masked from 'package:stats':
## 
##     nobs
## 
## The following object is masked from 'package:utils':
## 
##     object.size
rm(list = ls(all = TRUE))

Then get data from online – also a cool easy thing to do with R.

oilspot_url <- "http://www.eia.gov/dnav/pet/xls/PET_PRI_SPT_S1_D.xls"
oilspot_dat <- read.xls(oilspot_url, sheet = 2, pattern = "Cushing")

oilfut_url <- "http://www.eia.gov/dnav/pet/xls/PET_PRI_FUT_S1_D.xls"
oilfut_dat <- read.xls(oilfut_url, sheet = 2, pattern = "Cushing, OK Crude Oil Future Contract 1")

Use R’s built-in system for converting jumbles of letters and numbers into dates,

date_spot <- as.Date(oilspot_dat$Date, format='%b %d,%Y')
date_fut <- as.Date(oilfut_dat$Date, format='%b %d,%Y')

Then R’s “ts” for time-series and the “zoo” package.

wti_spot <- ts(oilspot_dat$Cushing..OK.WTI.Spot.Price.FOB..Dollars.per.Barrel., start = c(1986,2), frequency = 365)
wti_fut1 <- ts(oilfut_dat$Cushing..OK.Crude.Oil.Future.Contract.1..Dollars.per.Barrel., start = c(1983,89), frequency = 365)

wti_sp_dat <- zoo(wti_spot,date_spot)
wti_ft_dat <- zoo(wti_fut1,date_fut)

wti_spotfut <- merge(wti_sp_dat,wti_ft_dat, all=FALSE)

And plot the results,

plot(wti_spotfut, plot.type = "single", col = c("black", "blue"))

It’s tough to see any difference, so try this

wti_2013 <- window(wti_spotfut, start = as.Date("2013-01-01"), end = as.Date("2013-12-31"))
plot(wti_2013, plot.type = "single", col = c("black", "blue"))

If you like this publication, you can get fancier,

asTheEconomist(xyplot(wti_2013, xlab="Cushing WTI Spot Future Price"))

De-bugging

Without a doubt, programming is tough. In R or with any other program, it is frustrating and complicated and difficult to do it the first few times. Some days it feels like a continuous battle just to do the simplest thing! Keep going despite that, keep working on it.

Your study group will be very helpful of course.

Often a google search of the error message helps. If you’ve isolated the error and read the help documentation on that command, then you’re on your way to solving the problem on your own.

If you have troubles that you can’t solve, email me for help. But try to narrow down your question: if you run 20 lines of code that produce an error, is there a way to reproduce the error after just 5 lines? What if you did the same command on much simpler data, would it still cause an error? Sending emails like “I have a problem with errors” might be cathartic but is not actually useful to anyone. If you email me with the minimal code that recreates the error, along with the text of the error and/or a screenshot, then that will help more.