Econ B2000, Statistics and Introduction to Econometrics

Kevin R Foster, Colin Powell School, the City College of New York, CUNY

For this class we’ll be using the statistical analysis program, R. The computer labs here on campus have the necessary software. If you have your own computer, download and install the program from R Project along with R-Studio. Depending on your particular device these details will be different so take it patiently and figure it out - there are sufficient online sources for help, to address any problem that arises.

The best way to learn it is to just do it. How many times have you got a new game, skipped reading the instructions, and learned by crashing it a few times? This isn’t quite so simple but the basic gist remains - try it. Below I give you some pointers about how to just do it, but they’re meant to be read then immediately done in real life, I’ll give you some commands to just copy-and-paste into the program.

The program, R, is the machine underneath that does the work, while R-Studio is a skin on top that makes it easier to use. Install both then run R-Studio, and you’ll get something that looks like this: image of R Studio

The screen has 4 parts: the Console at lower left (where I drew the green arrow) is most important since that’s where you type in the commands to R.

You can start by just copying and pasting commands from this help into the “Console” and seeing the output from the program.

The guide, An Introduction to R, suggests these commands to give a basic flavor of what’s done and hint at the power. Don’t worry too much about each step for now, this is just a glimpse to show that with a few commands you can do some relatively sophisticated estimation – i.e. for a small cost you can get a large benefit. Copy them and paste into the “Console”.

x <- 1:50
w <- 1 + sqrt(x)/2
example1 <- data.frame(x=x, y= x + rnorm(x)*w)
attach(example1)

This creates x and y variables (where the rnorm command creates random numbers from a normal distribution), puts them into a data frame, then attaches that data frame so that R can use it in later calculations. (What’s a data frame? A way to collect together related data items.) Next some stats - create a linear model (that’s the “lm”) then a “lowess” nonparametric local regression, and plot the two estimations to compare. (Just copy and paste these, don’t worry about understanding them for now!)

fm <- lm(y ~ x)
summary(fm)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1182 -2.3420 -0.4238  2.8271  7.9599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.11519    1.11950  -0.103    0.918    
## x            0.99271    0.03821  25.982   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.899 on 48 degrees of freedom
## Multiple R-squared:  0.9336, Adjusted R-squared:  0.9322 
## F-statistic:   675 on 1 and 48 DF,  p-value: < 2.2e-16
lrf <- lowess(x, y)
plot(x, y)
lines(x, lrf$y)
abline(0, 1, lty=3)
abline(coef(fm))

detach()

Your own graph should look similar although with the random numbers, not exactly.

The final “detach” command just cleans up, it is the opposite of “attach”.

Later you will find that while it is possible to go back and fix up code that you previously ran in R (the up arrow brings most recent commands), it is a bit of a pain. It is easier to write out the code in the upper-left panel, then R-Studio can obediently run all or part of it in R for you. Then if you make a mistake or just want to do it again, it’s a bit easier. You can save your code for next time, as well.

One big thing to learn is that while pushing buttons is easier at first, eventually you want to be able to write code. I won’t insist that you do that from week #1, but keep it in mind as a goal to work towards. In R, you can usually see the code generated by the button-pushes so you can learn.

For all of these commands, you can use the help box on R Studio. But as I said, don’t worry much about those commands for now, I’m not expecting you to become an R-ninja overnight.

With Some Data

Next we will go through some basic stats with R using the Census Bureau’s PUMS (Public Use Microdata Sample, from the American Community Survey, accessed from IPUMS). I have restricted the sample to contain only people living in the state of New York, a sample of 196,585. The full sample, with 3,190,040 observations, is also on the class website if you’d like – although you might find that it slows your computer quite a bit!

You have to learn a bit about your computer’s filing system. When you download stuff it probably goes into a Download folder. When you installed R and R-Studio, those went into their own folders (probably within some Application folder) and then the program might have created a R folder for your work. When you do homework or other projects, those might (well, really should) get their own folder. But joining those up is a bit of housekeeping. If you read the pros, you’ll get more tips on organization.

Go and download the PUMS data from the class page, that will likely put that zip file into your Downloads folder. Within that zip file is one particular file, acs2017_ny_data.RData - move that into some appropriately-titled folder for this project. (Personally I might use the default “R Projects” folder then create “PUMS Data” as a folder within that, but you can choose. Young people with supple minds can get by even if they have zero organization like a hippie, but later you might appreciate being tidier.)

R is looking for files in a particular directory. Type the command, “getwd()” to see where it’s currently looking. Then use the command, “setwd,” to tell it where it ought to be looking (the R Projects/PUMS Data folder that you created just now, or wherever you put that data file). Alt you can click “Session” then “Set Working Directory” then “Choose Directory” and click the folder, and that will insert the “setwd” command onto the Console line. Tell R to look for the data in the folder where you put the data.

If you have a Dropbox (or Git or whatever) account then that can be worked in the same way, if you set that up as a folder on your computer.

Then run these commands (output from those commands is below),

rm(list = ls(all = TRUE)) # clear workspace

# change this command to your own directory where you put the data file 
# setwd("C:\\Users\\kevin\\Documents\\R")
# ^^^^^^

load("acs2017_ny_data.RData")
#glimpse(acs2017_ny) try this later
acs2017_ny[1:10,1:7]
##    AGE female educ_nohs educ_hs educ_somecoll educ_college educ_advdeg
## 1   72      1         0       0             0            0           1
## 2   72      0         0       0             0            0           1
## 3   31      0         0       0             0            1           0
## 4   28      1         0       0             0            1           0
## 5   54      0         0       0             0            0           1
## 6   45      1         0       1             0            0           0
## 7   84      1         0       0             1            0           0
## 8   71      0         0       0             0            1           0
## 9   68      1         0       0             1            0           0
## 10  37      1         1       0             0            0           0
attach(acs2017_ny)

In the next section I’ll explain more about the data and what the lines mean but for now AGE is the person’s age in years and female is a 0/1 variable (takes value 1 for true and 0 for false). So if you look at the output, see that the first person on line 1 is a 72-year-old female, next is a 72-year-old male, then a 31-year-old male, etc. The set of variables beginning “educ_” are also 0/1 so the first 2 people have advanced degrees and next 2 have college degrees.

You can also use the command, summary, to find out about data.

summary(acs2017_ny)
##       AGE            female         educ_nohs        educ_hs      
##  Min.   : 0.00   Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:22.00   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :42.00   Median :1.0000   Median :0.000   Median :0.0000  
##  Mean   :41.57   Mean   :0.5156   Mean   :0.271   Mean   :0.2804  
##  3rd Qu.:60.00   3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:1.0000  
##  Max.   :95.00   Max.   :1.0000   Max.   :1.000   Max.   :1.0000  
##                                                                   
##  educ_somecoll    educ_college     educ_advdeg   
##  Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :0.000   Median :0.0000   Median :0.000  
##  Mean   :0.173   Mean   :0.1567   Mean   :0.119  
##  3rd Qu.:0.000   3rd Qu.:0.0000   3rd Qu.:0.000  
##  Max.   :1.000   Max.   :1.0000   Max.   :1.000  
##                                                  
##                SCHOOL                              EDUC      
##  N/A              :  5569   Grade 12                 :55119  
##  No, not in school:144968   4 years of college       :30802  
##  Yes, in school   : 46048   5+ years of college      :23385  
##  Missing          :     0   1 year of college        :19947  
##                             Nursery school to grade 4:14240  
##                             2 years of college       :14065  
##                             (Other)                  :39027  
##                                           EDUCD      
##  Regular high school diploma                 :35689  
##  Bachelor's degree                           :30802  
##  1 or more years of college credit, no degree:19947  
##  Master's degree                             :17010  
##  Associate's degree, type not specified      :14065  
##  Some college, but less than 1 year          : 9086  
##  (Other)                                     :69986  
##                                      DEGFIELD     
##  N/A                                     :142398  
##  Business                                :  9802  
##  Education Administration and Teaching   :  6708  
##  Social Sciences                         :  4836  
##  Medical and Health Sciences and Services:  3919  
##  Fine Arts                               :  3491  
##  (Other)                                 : 25431  
##                                   DEGFIELDD     
##  N/A                                   :142398  
##  Psychology                            :  2926  
##  Business Management and Administration:  2501  
##  Accounting                            :  2284  
##  General Education                     :  2238  
##  English Language and Literature       :  2202  
##  (Other)                               : 42036  
##                                  DEGFIELD2     
##  N/A                                  :190425  
##  Business                             :   972  
##  Social Sciences                      :   853  
##  Education Administration and Teaching:   611  
##  Fine Arts                            :   465  
##  Communications                       :   352  
##  (Other)                              :  2907  
##                                                            DEGFIELD2D    
##  N/A                                                            :190425  
##  Psychology                                                     :   284  
##  Economics                                                      :   260  
##  Political Science and Government                               :   243  
##  Business Management and Administration                         :   217  
##  French, German, Latin and Other Common Foreign Language Studies:   205  
##  (Other)                                                        :  4951  
##       PUMA            GQ           OWNERSHP       OWNERSHPD    
##  Min.   : 100   Min.   :1.000   Min.   :0.000   Min.   : 0.00  
##  1st Qu.:1500   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:12.00  
##  Median :3201   Median :1.000   Median :1.000   Median :13.00  
##  Mean   :2713   Mean   :1.148   Mean   :1.266   Mean   :14.95  
##  3rd Qu.:3902   3rd Qu.:1.000   3rd Qu.:2.000   3rd Qu.:22.00  
##  Max.   :4114   Max.   :5.000   Max.   :2.000   Max.   :22.00  
##                                                                
##     MORTGAGE        OWNCOST           RENT         COSTELEC   
##  Min.   :0.000   Min.   :    0   Min.   :   0   Min.   :   0  
##  1st Qu.:0.000   1st Qu.: 1208   1st Qu.:   0   1st Qu.: 960  
##  Median :1.000   Median : 2891   Median :   0   Median :1560  
##  Mean   :1.453   Mean   :38582   Mean   : 393   Mean   :2311  
##  3rd Qu.:3.000   3rd Qu.:99999   3rd Qu.: 630   3rd Qu.:2520  
##  Max.   :4.000   Max.   :99999   Max.   :3800   Max.   :9997  
##                                                               
##     COSTGAS        COSTWATR       COSTFUEL       HHINCOME      
##  Min.   :   0   Min.   :   0   Min.   :   0   Min.   : -11800  
##  1st Qu.: 840   1st Qu.: 320   1st Qu.:9993   1st Qu.:  41600  
##  Median :2400   Median :1400   Median :9993   Median :  81700  
##  Mean   :5032   Mean   :4836   Mean   :7935   Mean   : 114902  
##  3rd Qu.:9993   3rd Qu.:9993   3rd Qu.:9993   3rd Qu.: 140900  
##  Max.   :9997   Max.   :9997   Max.   :9997   Max.   :2030000  
##                                               NA's   :10630    
##     FOODSTMP        LINGISOL         ROOMS           BUILTYR2     
##  Min.   :1.000   Min.   :0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.: 4.000   1st Qu.: 1.000  
##  Median :1.000   Median :1.000   Median : 6.000   Median : 3.000  
##  Mean   :1.147   Mean   :1.002   Mean   : 5.887   Mean   : 3.711  
##  3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.: 8.000   3rd Qu.: 5.000  
##  Max.   :2.000   Max.   :2.000   Max.   :16.000   Max.   :22.000  
##                                                                   
##     UNITSSTR        FUELHEAT          SSMC            FAMSIZE      
##  Min.   : 0.00   Min.   :0.000   Min.   :0.00000   Min.   : 1.000  
##  1st Qu.: 3.00   1st Qu.:2.000   1st Qu.:0.00000   1st Qu.: 2.000  
##  Median : 3.00   Median :2.000   Median :0.00000   Median : 3.000  
##  Mean   : 4.39   Mean   :2.959   Mean   :0.01102   Mean   : 3.087  
##  3rd Qu.: 6.00   3rd Qu.:4.000   3rd Qu.:0.00000   3rd Qu.: 4.000  
##  Max.   :10.00   Max.   :9.000   Max.   :2.00000   Max.   :19.000  
##                                                                    
##      NCHILD           NCHLT5            RELATE          RELATED      
##  Min.   :0.0000   Min.   :0.00000   Min.   : 1.000   Min.   : 101.0  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.: 1.000   1st Qu.: 101.0  
##  Median :0.0000   Median :0.00000   Median : 2.000   Median : 201.0  
##  Mean   :0.5009   Mean   :0.08441   Mean   : 3.307   Mean   : 335.6  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.: 3.000   3rd Qu.: 301.0  
##  Max.   :9.0000   Max.   :5.00000   Max.   :13.000   Max.   :1301.0  
##                                                                      
##      MARST            RACE          RACED         HISPAN      
##  Min.   :1.000   Min.   :1.00   Min.   :100   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:1.00   1st Qu.:100   1st Qu.:0.0000  
##  Median :5.000   Median :1.00   Median :100   Median :0.0000  
##  Mean   :3.742   Mean   :2.03   Mean   :205   Mean   :0.4153  
##  3rd Qu.:6.000   3rd Qu.:2.00   3rd Qu.:200   3rd Qu.:0.0000  
##  Max.   :6.000   Max.   :9.00   Max.   :990   Max.   :4.0000  
##                                                               
##     HISPAND                  BPL                         BPLD       
##  Min.   :  0.00   New York     :128517   New York          :128517  
##  1st Qu.:  0.00   West Indies  :  8481   China             :  4116  
##  Median :  0.00   China        :  4964   Dominican Republic:  3517  
##  Mean   : 44.75   SOUTH AMERICA:  4957   Pennsylvania      :  3303  
##  3rd Qu.:  0.00   India        :  3476   New Jersey        :  3127  
##  Max.   :498.00   Pennsylvania :  3303   Puerto Rico       :  2272  
##                   (Other)      : 42887   (Other)           : 51733  
##                      ANCESTR1    
##  Not Reported            :32021  
##  Italian                 :20577  
##  Irish, various subheads,:16388  
##  German                  :12781  
##  African-American        : 9559  
##  United States           : 8209  
##  (Other)                 :97050  
##                                    ANCESTR1D             ANCESTR2     
##  Not Reported                           :32021   Not Reported:141487  
##  Italian (1990-2000, ACS, PRCS)         :20577   German      :  9476  
##  Irish                                  :15651   Irish       :  9238  
##  German (1990-2000, ACS/PRCS)           :12605   English     :  4895  
##  African-American (1990-2000, ACS, PRCS): 9559   Italian     :  4531  
##  United States                          : 8209   Polish      :  3113  
##  (Other)                                :97963   (Other)     : 23845  
##                           ANCESTR2D         CITIZEN          YRSUSA1      
##  Not Reported                  :141487   Min.   :0.0000   Min.   : 0.000  
##  German (1990-2000, ACS, PRCS) :  9441   1st Qu.:0.0000   1st Qu.: 0.000  
##  Irish                         :  8809   Median :0.0000   Median : 0.000  
##  English                       :  4895   Mean   :0.4793   Mean   : 5.377  
##  Italian (1990-2000, ACS, PRCS):  4531   3rd Qu.:0.0000   3rd Qu.: 0.000  
##  Polish                        :  3113   Max.   :3.0000   Max.   :92.000  
##  (Other)                       : 24309                                    
##     HCOVANY         HCOVPRIV         SEX            EMPSTAT     
##  Min.   :1.000   Min.   :1.000   Male  : 95222   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:1.000   Female:101363   1st Qu.:1.000  
##  Median :2.000   Median :2.000                   Median :1.000  
##  Mean   :1.951   Mean   :1.691                   Mean   :1.514  
##  3rd Qu.:2.000   3rd Qu.:2.000                   3rd Qu.:3.000  
##  Max.   :2.000   Max.   :2.000                   Max.   :3.000  
##                                                                 
##     EMPSTATD        LABFORCE          OCC              IND       
##  Min.   : 0.00   Min.   :0.000   0      : 79987   0      :79987  
##  1st Qu.:10.00   1st Qu.:1.000   2310   :  3494   7860   : 9025  
##  Median :10.00   Median :2.000   5700   :  3235   8680   : 6354  
##  Mean   :15.16   Mean   :1.331   430    :  3025   770    : 6279  
##  3rd Qu.:30.00   3rd Qu.:2.000   4720   :  2666   8190   : 5873  
##  Max.   :30.00   Max.   :2.000   4760   :  2563   7870   : 4041  
##                                  (Other):101615   (Other):85026  
##     CLASSWKR       CLASSWKRD        WKSWORK2        UHRSWORK    
##  Min.   :0.000   Min.   : 0.00   Min.   :0.000   Min.   : 0.00  
##  1st Qu.:0.000   1st Qu.: 0.00   1st Qu.:0.000   1st Qu.: 0.00  
##  Median :2.000   Median :22.00   Median :1.000   Median :12.00  
##  Mean   :1.116   Mean   :13.03   Mean   :2.701   Mean   :19.77  
##  3rd Qu.:2.000   3rd Qu.:22.00   3rd Qu.:6.000   3rd Qu.:40.00  
##  Max.   :2.000   Max.   :29.00   Max.   :6.000   Max.   :99.00  
##                                                                 
##      INCTOT           FTOTINC           INCWAGE          POVERTY     
##  Min.   :  -7300   Min.   : -11800   Min.   :     0   Min.   :  0.0  
##  1st Qu.:   8000   1st Qu.:  35550   1st Qu.:     0   1st Qu.:159.0  
##  Median :  25000   Median :  74000   Median : 10000   Median :351.0  
##  Mean   :  45245   Mean   : 107111   Mean   : 33796   Mean   :318.7  
##  3rd Qu.:  56500   3rd Qu.: 132438   3rd Qu.: 47000   3rd Qu.:501.0  
##  Max.   :1563000   Max.   :2030000   Max.   :638000   Max.   :501.0  
##  NA's   :31129     NA's   :10817     NA's   :33427                   
##     MIGRATE1       MIGRATE1D        MIGPLAC1         MIGCOUNTY1     
##  Min.   :0.000   Min.   : 0.00   Min.   :  0.000   Min.   :  0.000  
##  1st Qu.:1.000   1st Qu.:10.00   1st Qu.:  0.000   1st Qu.:  0.000  
##  Median :1.000   Median :10.00   Median :  0.000   Median :  0.000  
##  Mean   :1.122   Mean   :11.51   Mean   :  6.184   Mean   :  4.117  
##  3rd Qu.:1.000   3rd Qu.:10.00   3rd Qu.:  0.000   3rd Qu.:  0.000  
##  Max.   :4.000   Max.   :40.00   Max.   :900.000   Max.   :810.000  
##                                                                     
##     MIGPUMA1        VETSTAT          VETSTATD         PWPUMA00    
##  Min.   :    0   Min.   :0.0000   Min.   : 0.000   Min.   :    0  
##  1st Qu.:    0   1st Qu.:1.0000   1st Qu.:11.000   1st Qu.:    0  
##  Median :    0   Median :1.0000   Median :11.000   Median :    0  
##  Mean   :  277   Mean   :0.8621   Mean   : 9.412   Mean   : 1255  
##  3rd Qu.:    0   3rd Qu.:1.0000   3rd Qu.:11.000   3rd Qu.: 3100  
##  Max.   :70100   Max.   :2.0000   Max.   :20.000   Max.   :59300  
##                                                                   
##     TRANWORK         TRANTIME         DEPARTS           in_NYC      
##  Min.   : 0.000   Min.   :  0.00   Min.   :   0.0   Min.   :0.0000  
##  1st Qu.: 0.000   1st Qu.:  0.00   1st Qu.:   0.0   1st Qu.:0.0000  
##  Median : 0.000   Median :  0.00   Median :   0.0   Median :0.0000  
##  Mean   : 9.725   Mean   : 14.75   Mean   : 373.3   Mean   :0.3615  
##  3rd Qu.:10.000   3rd Qu.: 20.00   3rd Qu.: 732.0   3rd Qu.:1.0000  
##  Max.   :70.000   Max.   :138.00   Max.   :2345.0   Max.   :1.0000  
##                                                                     
##     in_Bronx       in_Manhattan       in_StatenI       in_Brooklyn   
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000  
##  Median :0.0000   Median :0.00000   Median :0.00000   Median :0.000  
##  Mean   :0.0538   Mean   :0.04981   Mean   :0.02084   Mean   :0.126  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.00000   Max.   :1.000  
##                                                                      
##    in_Queens      in_Westchester      in_Nassau          Hispanic     
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.00000   Median :0.00000   Median :0.0000  
##  Mean   :0.1111   Mean   :0.04413   Mean   :0.07032   Mean   :0.1387  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.00000   Max.   :1.0000  
##                                                                       
##     Hisp_Mex          Hisp_PR         Hisp_Cuban         Hisp_DomR      
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.000000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.0000   Median :0.000000   Median :0.00000  
##  Mean   :0.01626   Mean   :0.0436   Mean   :0.003403   Mean   :0.02827  
##  3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.000000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.000000   Max.   :1.00000  
##                                                                         
##      white             AfAm          Amindian            Asian        
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.000000   1st Qu.:0.00000  
##  Median :1.0000   Median :0.000   Median :0.000000   Median :0.00000  
##  Mean   :0.6997   Mean   :0.125   Mean   :0.003779   Mean   :0.08656  
##  3rd Qu.:1.0000   3rd Qu.:0.000   3rd Qu.:0.000000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.000   Max.   :1.000000   Max.   :1.00000  
##                                                                       
##     race_oth        unmarried       veteran        has_AnyHealthIns
##  Min.   :0.0000   Min.   :0.00   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:0.00000   1st Qu.:1.0000  
##  Median :0.0000   Median :0.00   Median :0.00000   Median :1.0000  
##  Mean   :0.1324   Mean   :0.45   Mean   :0.04443   Mean   :0.9513  
##  3rd Qu.:0.0000   3rd Qu.:1.00   3rd Qu.:0.00000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.00   Max.   :1.00000   Max.   :1.0000  
##                                                                    
##  has_PvtHealthIns  Commute_car      Commute_bus      Commute_subway   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :1.0000   Median :0.0000   Median :0.00000   Median :0.00000  
##  Mean   :0.6906   Mean   :0.2997   Mean   :0.02162   Mean   :0.07468  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000  
##                                                                       
##   Commute_rail     Commute_other     below_povertyline below_150poverty
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.000     Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000     1st Qu.:0.0000  
##  Median :0.00000   Median :0.00000   Median :0.000     Median :0.0000  
##  Mean   :0.01332   Mean   :0.05506   Mean   :0.122     Mean   :0.1965  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.000     3rd Qu.:0.0000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.000     Max.   :1.0000  
##                                                                        
##  below_200poverty   foodstamps    
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000  
##  Mean   :0.2676   Mean   :0.1465  
##  3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000  
## 
print(NN_obs <- length(AGE))
## [1] 196585

So this shows that there are 196,585 people in this dataset.

Simple Stats

We compare the average age of the men and the women in the data,

summary(AGE[female == 1])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   23.00   44.00   42.72   61.00   95.00
summary(AGE[!female])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   21.00   40.00   40.35   59.00   95.00

This uses the female dummy variable. The comparison is between those who have the variable female=1 (i.e. women) and those not female=1 (so logical not, denoted with the “!” symbol, i.e. men). I know, you can – and people do! – worry that this binary classification for gender misses some people; government statistics are just not there yet. Progress is slower than we’d like.

Women in this dataset are, on average, a bit older, with an average age of 42.7 compared with 40.4 for men. You might wonder (if you were to begin to think like a statistician) whether that is a big difference - hold onto that thought! Alternately you can use the commands to calculate the average, mean(), and the standard deviation, sd(), to get those statistics:

# here i want to find average ages of men and women
mean(AGE[female == 1])
## [1] 42.71629
sd(AGE[female == 1])
## [1] 23.72012
mean(AGE[!female])
## [1] 40.35398
sd(AGE[!female])
## [1] 23.1098

Later you might encounter cases where you want more complicated dummy variables and want to use logical relations “and” “or” “not” (the symbols “&”, “|”, “!”) or the “>=” or multiplication or division.

As you’re going along, if you’re copying-and-pasting then you might not have had trouble, but if you’re typing then you’ve probably realized that R is persnickety - the variable AGE is not equivalent to Age nor age nor aGe… You might be wondering if there’s an easier way than copy-paste; there is. If you highlight text in the Source pane and then hit CTRL-Enter, that will run the block of code. Alt if you click the “Knit HTML” button from the Rmd file then that will recreate the webpage as well as run all of the commands.

More Details

At the beginning you’ll be just copying commands that I give you but as you start making changes, you have to understand what’s going on.

For better or worse there is rarely just a single way of doing things, when people talk about the R language it does indeed have many features of a language. Just like a spoken language has many ways of saying “hi” so too there are lots of different ways to ask R to produce means of different groups. I am showing you a few particular ways but if you look around online you’ll find others.

Remember to save your work. The computers in the lab wipe the memory clean when you log off so back up your files. Either online (email it to yourself or upload to Google Drive or iCloud or Dropbox or whatever) and/or use a USB drive.

But be careful - you have the option of saving your code and/or your workspace; it is usually better to save the code. For example, your code might load some data (at this stage, usually data that I’ve provided). If you make changes to this data during your session, you might not want to restart with those changes. If you just save your code and re-run the program, you will start fresh. If instead you save the workspace, then you end up saving all of your scraps and changes - which eventually start to build up. So only save your workspace if you’ve run a large chunk of code that you don’t want to have to later re-run. (This is even more true if you “attach” data frames without later “detach”-ing them!)

There are 2 main file types: R script (lines of instructions to R on how to calculate some statistics) and R Markdown (which combines the instructions to R along with text around it, which is how I created this file). There are more types of course but those are the ones for now. To create files like the one you’re reading now, use “File \ New File \ R Markdown…” Give it a title and save the file with some name that you’ll remember and the extension, “.Rmd”. Remember the subdirectory or folder.

There is a bigger philosophy behind this: the idea of “reproducible research,” that if you share your .Rmd file then anybody else can run the exact same program and get the exact same results. It’s a way of convincing an audience that there was no skulduggery in how the data relates to the conclusion. It’s useful in class since it’s an easy way to submit homework or share work with your study group.

Variable Coding

Some of the PUMS variables here have a natural interpretation, for instance Age is measured in years. Actually even this has a bit of a twist, look at the histogram.

hist(AGE[(AGE > 90)])

There is a bit of weirdness in the right, where it looks like there are suddenly a bunch of people who are 95 but nobody is 94 or 96. This is due to a coding choice by the Census, where really old people are just labeled as “95” (top-coding) so it actually should be interpreted as meaning “92 or older”. So if you were to get finicky (and every good statistician is!) you might go back to the calculations of averages previously and modify them all like this, to select just those who are female and who are coded as having age less than 90. Many variables are topcoded! And recall that topcoding wouldn’t change the median values calculated before, which is a point in favor of that statistic.

mean(AGE[ (female == 1) & (AGE<90) ]) 
## [1] 41.98866

You go make those other changes, figure out how top-coding changes the calculations of average age by gender – I’ll wait right here…

Variable Coding Again

So we were saying that some variables, like Age - ahem! – have a natural interpretation as a number.

Others are logical variables (called dummies) like female, Hispanic, or married - there is a yes/no answer that is coded 1/0. Note that if you’re creating these on your own it’s good to give names that have that sort of yes/no answer, so a variable named ‘female’ is better than one named ‘gender’ where you’d have to remember who are coded adtrue and who are false.

Many variables, like PUMA, have no natural explanation at all. Here are the first codes,

str(as.numeric(PUMA))
##  num [1:196585] 902 902 4002 4002 3803 ...

You have to go to the codebook (or, in this case, the file PUMA_levels.csv or acs2017_codebook.txt from the zip file) to find out that 3801 codes for Washington Heights/Inwood, 3802 is Hamilton Heights/Manhattanville/West Harlem, etc. The program will happily calculate the average value for PUMA (type in mean(PUMA) and see for yourself!) but this is a meaningless value – wtf is the average neighborhood code value!? If you want to select just people living in a particular neighborhood then you’d have to look at the list below.

PUMA Neighborhood
3701 NYC-Bronx CD 8–Riverdale, Fieldston & Kingsbridge
3702 NYC-Bronx CD 12–Wakefield, Williamsbridge & Woodlawn
3703 NYC-Bronx CD 10–Co-op City, Pelham Bay & Schuylerville
3704 NYC-Bronx CD 11–Pelham Parkway, Morris Park & Laconia
3705 NYC-Bronx CD 3 & 6–Belmont, Crotona Park East & East Tremont
3706 NYC-Bronx CD 7–Bedford Park, Fordham North & Norwood
3707 NYC-Bronx CD 5–Morris Heights, Fordham South & Mount Hope
3708 NYC-Bronx CD 4–Concourse, Highbridge & Mount Eden
3709 NYC-Bronx CD 9–Castle Hill, Clason Point & Parkchester
3710 NYC-Bronx CD 1 & 2–Hunts Point, Longwood & Melrose
3801 NYC-Manhattan CD 12–Washington Heights, Inwood & Marble Hill
3802 NYC-Manhattan CD 9–Hamilton Heights, Manhattanville & West Harlem
3803 NYC-Manhattan CD 10–Central Harlem
3804 NYC-Manhattan CD 11–East Harlem
3805 NYC-Manhattan CD 8–Upper East Side
3806 NYC-Manhattan CD 7–Upper West Side & West Side
3807 NYC-Manhattan CD 4 & 5–Chelsea, Clinton & Midtown Business District
3808 NYC-Manhattan CD 6–Murray Hill, Gramercy & Stuyvesant Town
3809 NYC-Manhattan CD 3–Chinatown & Lower East Side
3810 NYC-Manhattan CD 1 & 2–Battery Park City, Greenwich Village & Soho
3901 NYC-Staten Island CD 3–Tottenville, Great Kills & Annadale
3902 NYC-Staten Island CD 2–New Springville & South Beach
3903 NYC-Staten Island CD 1–Port Richmond, Stapleton & Mariner’s Harbor
4001 NYC-Brooklyn CD 1–Greenpoint & Williamsburg
4002 NYC-Brooklyn CD 4—Bushwick
4003 NYC-Brooklyn CD 3–Bedford-Stuyvesant
4004 NYC-Brooklyn CD 2–Brooklyn Heights & Fort Greene
4005 NYC-Brooklyn CD 6–Park Slope, Carroll Gardens & Red Hook
4006 NYC-Brooklyn CD 8–Crown Heights North & Prospect Heights
4007 NYC-Brooklyn CD 16–Brownsville & Ocean Hill
4008 NYC-Brooklyn CD 5–East New York & Starrett City
4009 NYC-Brooklyn CD 18–Canarsie & Flatlands
4010 NYC-Brooklyn CD 17–East Flatbush, Farragut & Rugby
4011 NYC-Brooklyn CD 9–Crown Heights South, Prospect Lefferts & Wingate
4012 NYC-Brooklyn CD 7–Sunset Park & Windsor Terrace
4013 NYC-Brooklyn CD 10–Bay Ridge & Dyker Heights
4014 NYC-Brooklyn CD 12–Borough Park, Kensington & Ocean Parkway
4015 NYC-Brooklyn CD 14–Flatbush & Midwood
4016 NYC-Brooklyn CD 15–Sheepshead Bay, Gerritsen Beach & Homecrest
4017 NYC-Brooklyn CD 11–Bensonhurst & Bath Beach
4018 NYC-Brooklyn CD 13–Brighton Beach & Coney Island
4101 NYC-Queens CD 1–Astoria & Long Island City
4102 NYC-Queens CD 3–Jackson Heights & North Corona
4103 NYC-Queens CD 7–Flushing, Murray Hill & Whitestone
4104 NYC-Queens CD 11–Bayside, Douglaston & Little Neck
4105 NYC-Queens CD 13–Queens Village, Cambria Heights & Rosedale
4106 NYC-Queens CD 8–Briarwood, Fresh Meadows & Hillcrest
4107 NYC-Queens CD 4–Elmhurst & South Corona
4108 NYC-Queens CD 6–Forest Hills & Rego Park
4109 NYC-Queens CD 2–Sunnyside & Woodside
4110 NYC-Queens CD 5–Ridgewood, Glendale & Middle Village
4111 NYC-Queens CD 9–Richmond Hill & Woodhaven
4112 NYC-Queens CD 12–Jamaica, Hollis & St. Albans
4113 NYC-Queens CD 10–Howard Beach & Ozone Park
4114 NYC-Queens CD 14–Far Rockaway, Breezy Point & Broad Channel

Now you’re probably thinking, isn’t there some easier way? Yes there is. R has variables called “factors” that join together the long list of codes with a separate file telling what those codes mean. Later when we do further statistics, R will know how to appropriately treat these factors. (Also it will then give an error if you calculate mean(PUMA), which is proper.)

PUMA <- as.factor(PUMA)
female <- as.factor(female)

I will leave you to worry over the recoding of the other variables, because it’s good for the soul. I will show you 2 ways – the quick and dirty way, and the fancy correct way.

First the quick and dirty way.

print(levels(female))
## [1] "0" "1"
levels(female) <- c("male","female")

Well, ways,

educ_indx <- factor((educ_nohs + 2*educ_hs + 3*educ_somecoll + 4*educ_college + 5*educ_advdeg), levels=c(1,2,3,4,5),labels = c("No HS","HS","SmColl","Bach","Adv"))

(If you can figure out how that bit of code works, that would be good)

These just type in the levels. But for things like PUMA, it could be a long list and might not even match every one. To do it better, we need help from an R package.

Detour on Packages

But first a bit of a detour, to mention how to use packages. R depends crucially on “packages” - that’s the whole reason that the open-source works. Some statistician invents a cool new technique, then writes up the code in R and makes it available. If you used a commercial program you’d have to wait a decade for them to update it; in R it’s here now. Also if somebody hacks a nicer or easier way to do stuff, they write it up.

So enter this into the Console,

install.packages("tidyverse")
install.packages("plyr")

then

library(tidyverse)
library(plyr)
levels_n <- read.csv("PUMA_levels.csv")
levels_orig <- levels(PUMA) 
levels_new <- join(data.frame(levels_orig),data.frame(levels_n))
levels(PUMA) <- levels_new$New_Level

Alt, from R-Studio, click “Tools” then “Install Packages…” and tell it to install the packages, “plyr” and “tidyverse”. That is nice if you want to see some of the packages or if you don’t quite remember the name. Then the next piece of code, library, tells the program that you want to use commands from this package.

Those commands read in a little csv file that I had made, with the PUMA codes, then matches the old codes with the new complete text. Note that I’m lazy so codes in NY state outside of NYC are coded NA.

Back from Detour

R will do the summary differently when it knows the variable is a factor,

summary(female)
##   male female 
##  95222 101363
summary(PUMA)
##                   NYC-Bronx CD 8--Riverdale, Fieldston & Kingsbridge 
##                                                                  936 
##                NYC-Bronx CD 12--Wakefield, Williamsbridge & Woodlawn 
##                                                                 1092 
##              NYC-Bronx CD 10--Co-op City, Pelham Bay & Schuylerville 
##                                                                  767 
##               NYC-Bronx CD 11--Pelham Parkway, Morris Park & Laconia 
##                                                                 1115 
##        NYC-Bronx CD 3 & 6--Belmont, Crotona Park East & East Tremont 
##                                                                 1311 
##                NYC-Bronx CD 7--Bedford Park, Fordham North & Norwood 
##                                                                  854 
##           NYC-Bronx CD 5--Morris Heights, Fordham South & Mount Hope 
##                                                                 1112 
##                   NYC-Bronx CD 4--Concourse, Highbridge & Mount Eden 
##                                                                  917 
##              NYC-Bronx CD 9--Castle Hill, Clason Point & Parkchester 
##                                                                 1307 
##                  NYC-Bronx CD 1 & 2--Hunts Point, Longwood & Melrose 
##                                                                 1166 
##        NYC-Manhattan CD 12--Washington Heights, Inwood & Marble Hill 
##                                                                 1238 
##   NYC-Manhattan CD 9--Hamilton Heights, Manhattanville & West Harlem 
##                                                                  872 
##                                  NYC-Manhattan CD 10--Central Harlem 
##                                                                  897 
##                                     NYC-Manhattan CD 11--East Harlem 
##                                                                  769 
##                                  NYC-Manhattan CD 8--Upper East Side 
##                                                                 1167 
##                      NYC-Manhattan CD 7--Upper West Side & West Side 
##                                                                  949 
## NYC-Manhattan CD 4 & 5--Chelsea, Clinton & Midtown Business District 
##                                                                  944 
##          NYC-Manhattan CD 6--Murray Hill, Gramercy & Stuyvesant Town 
##                                                                  758 
##                      NYC-Manhattan CD 3--Chinatown & Lower East Side 
##                                                                 1062 
##  NYC-Manhattan CD 1 & 2--Battery Park City, Greenwich Village & Soho 
##                                                                 1136 
##          NYC-Staten Island CD 3--Tottenville, Great Kills & Annadale 
##                                                                 1303 
##                NYC-Staten Island CD 2--New Springville & South Beach 
##                                                                 1173 
##  NYC-Staten Island CD 1--Port Richmond, Stapleton & Mariner's Harbor 
##                                                                 1621 
##                         NYC-Brooklyn CD 1--Greenpoint & Williamsburg 
##                                                                 1293 
##                                          NYC-Brooklyn CD 4--Bushwick 
##                                                                 1060 
##                                NYC-Brooklyn CD 3--Bedford-Stuyvesant 
##                                                                 1082 
##                    NYC-Brooklyn CD 2--Brooklyn Heights & Fort Greene 
##                                                                 1320 
##            NYC-Brooklyn CD 6--Park Slope, Carroll Gardens & Red Hook 
##                                                                 1168 
##            NYC-Brooklyn CD 8--Crown Heights North & Prospect Heights 
##                                                                 1077 
##                         NYC-Brooklyn CD 16--Brownsville & Ocean Hill 
##                                                                  904 
##                     NYC-Brooklyn CD 5--East New York & Starrett City 
##                                                                 1321 
##                             NYC-Brooklyn CD 18--Canarsie & Flatlands 
##                                                                 2422 
##                  NYC-Brooklyn CD 17--East Flatbush, Farragut & Rugby 
##                                                                 1250 
##  NYC-Brooklyn CD 9--Crown Heights South, Prospect Lefferts & Wingate 
##                                                                  818 
##                     NYC-Brooklyn CD 7--Sunset Park & Windsor Terrace 
##                                                                 1291 
##                        NYC-Brooklyn CD 10--Bay Ridge & Dyker Heights 
##                                                                 1519 
##         NYC-Brooklyn CD 12--Borough Park, Kensington & Ocean Parkway 
##                                                                 1698 
##                               NYC-Brooklyn CD 14--Flatbush & Midwood 
##                                                                 1479 
##      NYC-Brooklyn CD 15--Sheepshead Bay, Gerritsen Beach & Homecrest 
##                                                                 1903 
##                         NYC-Brooklyn CD 11--Bensonhurst & Bath Beach 
##                                                                 2234 
##                    NYC-Brooklyn CD 13--Brighton Beach & Coney Island 
##                                                                  925 
##                          NYC-Queens CD 1--Astoria & Long Island City 
##                                                                 1748 
##                      NYC-Queens CD 3--Jackson Heights & North Corona 
##                                                                 1316 
##                  NYC-Queens CD 7--Flushing, Murray Hill & Whitestone 
##                                                                 2290 
##                  NYC-Queens CD 11--Bayside, Douglaston & Little Neck 
##                                                                 1344 
##         NYC-Queens CD 13--Queens Village, Cambria Heights & Rosedale 
##                                                                 2148 
##                NYC-Queens CD 8--Briarwood, Fresh Meadows & Hillcrest 
##                                                                 1393 
##                             NYC-Queens CD 4--Elmhurst & South Corona 
##                                                                  973 
##                            NYC-Queens CD 6--Forest Hills & Rego Park 
##                                                                 1041 
##                                NYC-Queens CD 2--Sunnyside & Woodside 
##                                                                 1158 
##                NYC-Queens CD 5--Ridgewood, Glendale & Middle Village 
##                                                                 2040 
##                           NYC-Queens CD 9--Richmond Hill & Woodhaven 
##                                                                 1694 
##                       NYC-Queens CD 12--Jamaica, Hollis & St. Albans 
##                                                                 2438 
##                          NYC-Queens CD 10--Howard Beach & Ozone Park 
##                                                                 1304 
##         NYC-Queens CD 14--Far Rockaway, Breezy Point & Broad Channel 
##                                                                  954 
##                                                                 NA's 
##                                                               125514
summary(educ_indx)
##  No HS     HS SmColl   Bach    Adv 
##  53267  55119  34012  30802  23385

To find mean and standard deviation by neighborhood, you could use something like this,

ddply(acs2017_ny, .(PUMA), summarize, mean = round(mean(AGE), 2), sd = round(sd(AGE), 2))
##     PUMA  mean    sd
## 1    100 41.70 23.85
## 2    200 43.47 23.45
## 3    300 44.88 23.60
## 4    401 45.12 24.28
## 5    402 42.23 24.18
## 6    403 43.22 23.66
## 7    500 40.01 23.73
## 8    600 39.77 23.29
## 9    701 36.64 22.44
## 10   702 42.13 23.02
## 11   703 43.70 24.11
## 12   704 44.71 23.83
## 13   800 43.93 23.62
## 14   901 45.35 24.04
## 15   902 40.74 22.29
## 16   903 38.70 22.42
## 17   904 43.70 23.84
## 18   905 40.84 22.75
## 19   906 41.48 24.26
## 20  1000 42.77 23.43
## 21  1101 44.33 24.22
## 22  1102 42.49 23.23
## 23  1201 44.28 24.74
## 24  1202 42.72 24.68
## 25  1203 43.80 23.84
## 26  1204 43.57 23.84
## 27  1205 40.35 23.50
## 28  1206 37.94 22.96
## 29  1207 45.42 23.92
## 30  1300 43.73 23.35
## 31  1400 43.80 24.01
## 32  1500 40.94 23.49
## 33  1600 43.30 24.68
## 34  1700 43.52 23.86
## 35  1801 42.63 23.60
## 36  1802 43.08 23.23
## 37  1900 42.81 23.57
## 38  2001 37.81 22.86
## 39  2002 43.45 23.90
## 40  2100 46.88 23.54
## 41  2201 41.39 23.92
## 42  2202 44.64 23.67
## 43  2203 44.31 24.08
## 44  2300 37.97 22.08
## 45  2401 45.15 23.39
## 46  2402 42.78 24.42
## 47  2500 42.73 24.33
## 48  2600 43.71 24.47
## 49  2701 44.76 23.71
## 50  2702 42.69 23.16
## 51  2801 43.81 23.74
## 52  2802 43.19 23.91
## 53  2901 41.72 23.03
## 54  2902 42.99 22.82
## 55  2903 34.24 23.39
## 56  3001 44.04 23.76
## 57  3002 42.34 24.24
## 58  3003 32.23 24.23
## 59  3101 43.50 22.72
## 60  3102 43.58 23.51
## 61  3103 43.50 23.89
## 62  3104 39.12 23.54
## 63  3105 42.69 24.40
## 64  3106 42.68 23.85
## 65  3107 42.05 23.57
## 66  3201 43.25 25.02
## 67  3202 44.08 24.19
## 68  3203 43.40 24.06
## 69  3204 43.77 23.56
## 70  3205 42.64 23.83
## 71  3206 40.57 23.55
## 72  3207 42.79 24.08
## 73  3208 43.68 23.96
## 74  3209 43.11 23.37
## 75  3210 42.34 23.58
## 76  3211 41.49 22.93
## 77  3212 44.51 25.12
## 78  3301 45.37 24.21
## 79  3302 43.40 24.24
## 80  3303 43.01 24.03
## 81  3304 41.35 23.85
## 82  3305 48.60 24.31
## 83  3306 43.79 22.92
## 84  3307 41.86 22.68
## 85  3308 41.31 23.04
## 86  3309 43.69 23.07
## 87  3310 39.30 22.38
## 88  3311 40.25 23.23
## 89  3312 40.43 22.31
## 90  3313 40.90 23.82
## 91  3701 43.12 25.58
## 92  3702 40.22 22.96
## 93  3703 43.63 24.07
## 94  3704 42.05 24.57
## 95  3705 34.78 22.47
## 96  3706 35.15 22.40
## 97  3707 33.70 22.15
## 98  3708 35.25 22.01
## 99  3709 38.88 23.99
## 100 3710 35.34 22.09
## 101 3801 40.55 22.58
## 102 3802 35.62 20.80
## 103 3803 39.45 21.16
## 104 3804 38.39 21.37
## 105 3805 43.53 23.63
## 106 3806 42.44 22.85
## 107 3807 40.20 19.30
## 108 3808 40.66 21.37
## 109 3809 40.98 22.18
## 110 3810 39.03 20.90
## 111 3901 42.89 23.76
## 112 3902 41.08 23.88
## 113 3903 40.75 22.72
## 114 4001 35.39 20.42
## 115 4002 34.12 19.37
## 116 4003 36.03 20.78
## 117 4004 36.75 20.23
## 118 4005 36.84 20.31
## 119 4006 38.50 20.69
## 120 4007 39.63 21.48
## 121 4008 36.65 22.17
## 122 4009 41.51 23.14
## 123 4010 42.14 23.08
## 124 4011 39.77 22.98
## 125 4012 36.75 21.46
## 126 4013 42.91 22.97
## 127 4014 35.35 24.37
## 128 4015 38.65 23.33
## 129 4016 42.18 24.11
## 130 4017 39.67 22.74
## 131 4018 45.54 24.77
## 132 4101 38.65 20.21
## 133 4102 38.72 22.55
## 134 4103 44.60 23.11
## 135 4104 45.76 23.24
## 136 4105 41.99 23.29
## 137 4106 40.17 23.88
## 138 4107 40.09 21.87
## 139 4108 42.64 23.42
## 140 4109 41.20 20.83
## 141 4110 40.21 22.13
## 142 4111 40.69 21.64
## 143 4112 40.37 22.80
## 144 4113 39.78 22.77
## 145 4114 39.47 24.25

Although tapply would also work fine.

Here’s the 90th and 10th percentiles of wages by neighborhood,

dat_use1 <- subset(acs2017_ny,((INCWAGE > 0) & in_NYC))
ddply(dat_use1, .(PUMA), summarize, inc90 = quantile(INCWAGE,probs = 0.9), inc10 = quantile(INCWAGE,probs = 0.1), n_obs = length(INCWAGE))
##    PUMA  inc90 inc10 n_obs
## 1  3701 120000  3220   424
## 2  3702  85000  6700   541
## 3  3703 100500  3750   366
## 4  3704  90000  6980   510
## 5  3705  52000  3000   537
## 6  3706  63200  5940   359
## 7  3707  60000  4000   439
## 8  3708  62000  6000   376
## 9  3709  78800  5220   503
## 10 3710  55000  3580   420
## 11 3801 100000  5000   670
## 12 3802 120000  3000   399
## 13 3803 130000  6000   478
## 14 3804 120000  7000   368
## 15 3805 300000 17900   636
## 16 3806 326000  7860   509
## 17 3807 268000 10000   635
## 18 3808 300000 20560   460
## 19 3809 140000  5000   515
## 20 3810 300000  6000   695
## 21 3901 127000  6220   617
## 22 3902 125000  8000   524
## 23 3903 100000  7100   771
## 24 4001 149500 10000   736
## 25 4002  82000  9000   581
## 26 4003 110000  7200   557
## 27 4004 166000  7000   786
## 28 4005 200000 12000   681
## 29 4006 114000  8740   585
## 30 4007  79000  4800   361
## 31 4008  73000  6000   549
## 32 4009 100000  9600  1178
## 33 4010  80200  8360   610
## 34 4011  95400  7000   407
## 35 4012 102200  6880   625
## 36 4013 124000  7440   773
## 37 4014  90000  5590   654
## 38 4015 100000  7450   710
## 39 4016 101200  6000   899
## 40 4017  97000  7200  1070
## 41 4018 100000  5000   368
## 42 4101 104000  9600  1041
## 43 4102  82400  8000   624
## 44 4103 100000  7180  1107
## 45 4104 110000  8000   661
## 46 4105 100000  7000  1080
## 47 4106 102000  8000   641
## 48 4107  70000  8000   499
## 49 4108 140000 10000   563
## 50 4109 129600 11600   655
## 51 4110 100000 10000  1049
## 52 4111  90000  8680   865
## 53 4112  84800  7260  1213
## 54 4113  93000  6000   625
## 55 4114 108300  6700   378

You could also use table (or crosstabs) for factors with fewer items,

table(educ_indx,female)
##          female
## educ_indx  male female
##    No HS  27180  26087
##    HS     27309  27810
##    SmColl 15847  18165
##    Bach   14632  16170
##    Adv    10254  13131
xtabs(~educ_indx + female)
##          female
## educ_indx  male female
##    No HS  27180  26087
##    HS     27309  27810
##    SmColl 15847  18165
##    Bach   14632  16170
##    Adv    10254  13131

Want proportions instead of counts?

prop.table(table(educ_indx,female))
##          female
## educ_indx       male     female
##    No HS  0.13826080 0.13270087
##    HS     0.13891701 0.14146552
##    SmColl 0.08061144 0.09240278
##    Bach   0.07443091 0.08225450
##    Adv    0.05216064 0.06679553

Remember prop.table later when we do marginals.

Try it and see what happens if you use table with PUMA…

This data includes not just whether a person has a college degree but also what field was the degree in: Economics or Psychology, for instance. Look over the codebook about DEGFIELD and DEGFIELDD (that second D means more detail) to see the codes. Maybe look at 10th and 90th percentiles by degree field?

In general, R is very flexible so there are often many different ways to get the same answer. There are some people who love to debate which is best. (Often, tradeoff between speed and intelligibility.) For now just worry about learning at least one way. Later on you can go back and refine your techniques.

Sometimes attaching a dataset makes things easier. But as you get more advanced you might find it better to include the dataset name inside the function. There are advantages and disadvantages each way and some of the intro texts suggest one or the other.

If you do a lot of analysis on a particular subgroup, it might be worthwhile to create a subset of that group, so that you don’t have to always add on logical conditions. These two sets of expressions, looking at “prime-age” people, get the same results:

mean(educ_nohs[(AGE >= 25)&(AGE <= 55)])
## [1] 0.08354656
mean(educ_hs[(AGE >= 25)&(AGE <= 55)])
## [1] 0.2974594
mean(educ_somecoll[(AGE >= 25)&(AGE <= 55)])
## [1] 0.2057843
mean(educ_college[(AGE >= 25)&(AGE <= 55)])
## [1] 0.2383112
mean(educ_advdeg[(AGE >= 25)&(AGE <= 55)])
## [1] 0.1748986
# alternatively
restrict1 <- as.logical((AGE >= 25)&(AGE <= 55))
dat_age_primeage <- subset(acs2017_ny, restrict1)

detach()
attach(dat_age_primeage)

mean(educ_nohs)
## [1] 0.08354656
mean(educ_hs)
## [1] 0.2974594
mean(educ_somecoll)
## [1] 0.2057843
mean(educ_college)
## [1] 0.2383112
mean(educ_advdeg)
## [1] 0.1748986
detach()

So you detach the original data frame and instead attach the restricted version. Then any subsequent analysis would be just done on that subset. Just remember that you’ve done this (again, this is a good reason to save the commands in a program so you can look back) otherwise you’ll wonder why you suddenly don’t have any kids in the sample!

Why All These Details?

You might be tired and bored by these details, but note that there are actually important choices to be made here, even in simply defining variables. Take the fraught American category of “race”. This data has a variable, RACED, showing how people chose to classify themselves, as ‘White,’ ‘Black,’ ‘American Indian or Alaska Native,’ ‘Asian,’ various combinations, and many more codes.

Suppose you wanted to find out how many Asians are in a particular population. You could count how many people identify themselves as Asian only; you could count how many people identify as Asian in any combination. Sometimes the choice is irrelevant; sometimes it can skew the final results (e.g. the question in some areas, are there more African-Americans or more Hispanics?).

Again, there’s no “right” way to do it because there’s no science in this peculiar-but-popular concept of “race”. People’s conceptions of themselves are fuzzy and complicated; these measures are approximations.

Basics of government race/ethnicity classification

The US government asks questions about people’s race and ethnicity. These categories are social constructs, which is a fancy way of pointing out that they are based on people’s own views of themselves (influenced by how we think that other people think of us…). Currently the standard classification asks people separately about their “race” and “ethnicity” where people can pick labels from each category in any combination.

The “race” categories include: “White,” “African-American,” “American Indian,” “Asian,” and others. Then the supplemental race categories offer more detail.

These are a peculiar combination of very general (well over 40% of the world’s population is “Asian”) and very specific (“American Indian”) representing a peculiar history of popular attitudes in the US. Only in the 2000 Census did they start to classify people in mixed races. If you were to go back to historical US Censuses from more than a century ago, you would find that the category “race” included separate entries for Irish and French. Stephen J Gould has a fascinating book, The Mismeasure of Man, discussing how early scientific classifications of humans tried to “prove” which nationalities/races/groups were the smartest. Ta-Nehisi Coates notes, “racism invented race in America.”

Note that “Hispanic” is not “race” but rather ethnicity (includes various other labels such as Spanish, Latino, etc.). So a respondent could choose “Hispanic” and any race category – some choose “White,” some choose “African American,” some might be combined with any other of those complicated racial categories.

If you wanted to create a variable for those who report themselves as African-American and Hispanic, you’d use the expression (AfAm == 1) & (Hispanic == 1); sometimes stats report for non-Hispanic whites so (white == 1) & (Hispanic != 1). You can create your own classifications depending on what questions you’re investigating. This data includes items on birthplace and ancestry (more detail on relatives!).

The Census Bureau gives more information here.

All of these racial categories make some people uneasy: is the government encouraging racism by recognizing these classifications? Some other governments choose not to collect race data. But that doesn’t mean that there are no differences, only that the government doesn’t choose to measure any of these differences. In the US, government agencies such as the Census and BLS don’t generally collect data on religion.

Re-Coding complicated variables from initial data

If we want more combinations of variables then we create those. Usually a statistical analysis spends a lot of time doing this sort of housekeeping - dull but necessary. It has a variety of names: data carpentry, data munging…

Educational attainment is also classified with complicated codes in this data: the original data has code 63 to mean high school diploma, 64 for a GED, 65 for less than a year of college, etc. I have transformed them into a series of dummy variables, zero/one variables for whether a person has no high school diploma, just a high school diploma, some college, a bachelor’s degree, or an advanced degree. As with race, there is no rule that you must always do it thus. Your own analysis might be different. (Is it valid to lump together everybody who lacks a high school diploma, no matter whether they completed just first grade or up to 12th?)

That’s the whole point of learning to do the data work for yourself: you can see all of the little decisions that go into creating a conclusion. Some conclusions might be fragile so a tiny decision about coding could change everything; other conclusions are robust to deviations. You must find out.

De-bugging

Without a doubt, programming is tough. In R or with any other program, it is frustrating and complicated and difficult to do it the first few times. Some days it feels like a continuous battle just to do the simplest thing! Keep going despite that, keep working on it.

Your study group will be very helpful of course.

Often a google search of the error message helps. If you’ve isolated the error and read the help documentation on that command, then you’re on your way to solving the problem on your own.

If you have troubles that you can’t solve, email me for help. But try to narrow down your question: if you run 20 lines of code that produce an error, is there a way to reproduce the error after just 5 lines? What if you did the same command on much simpler data, would it still cause an error? Sending emails like “I have a problem with errors” might be cathartic but is not actually useful to anyone. If you email me with the minimal code that recreates the error, along with the text of the error and/or a screenshot, then that will help more.

Do it

The first homework assignment asks you to start working on these questions. Begin by running the code that I give here, just to see if you can replicate my results. Then start asking more questions about the data - there is so much info there! Immigration/ancestry status, how they commute, health insurance, poverty, income, owner or renter (and how much they pay!), all sorts of info. Have some fun.