Econ B2000, MA Econometrics

Kevin R Foster, the Colin Powell School at the City College of New York, CUNY

Fall 2019

For this lab, we will shake things up – a different dataset! We’ll estimate a logit model to try to predict if a person has health insurance.

Form a group of 3. Groups should prepare a 4-min presentation by one of the group members about their experiment process and results. You get 45 min to prepare.

Download the NHIS data and load it into R. If you do a summary of the data, you will see that it includes a variety of people of all ages. We want to understand what factors make an adult more likely to have health insurance. Some of the variable names are a bit mystifying: disabl_limit codes 0/1 if the person has any limitations from a disability; RRP codes relationship to the person answering the question; HHX, FMX, and FPX just are ID numbers; SCHIP is a children’s healthcare system; sptn_medical is a factor telling how much the person spent on medical bills. This is only a tiny fraction of the information in that survey; there are more than 1000 different variables.

The person’s earning need a bit of recoding,

data_use1$earn_lastyr <- as.factor(data_use1$ERNYR_P)
levels(data_use1$earn_lastyr) <- c("0","$01-$4999","$5000-$9999","$10000-$14999","$15000-$19999","$20000-$24999","$25000-$34999","$35000-$44999","$45000-$54999","$55000-$64999","$65000-$74999","$75000 and over",NA,NA,NA)

First decide on how you’re defining your subgroup (all adults? Within certain age? Other?) then find some basic statistics – what fraction are not covered? (Later go back to look at simple stats for subgroups to see if there are sharp differences.)

Run a logit regression. An example is below. What are the other variables you are using in your regression? Do they have the expected signs and patterns of significance? Explain if there is a plausible causal link from X variables to Y and not the reverse. Explain your results, giving details about the estimation, some predicted values, and providing any relevant graphics. Impress.

model_logit1 <- glm(NOTCOV ~ AGE_P + I(AGE_P^2) + female + AfAm + Asian + RaceOther  
                    + Hispanic + educ_hs + educ_smcoll + educ_as + educ_bach + educ_adv 
                    + married + widowed + divorc_sep + veteran_stat + REGION + region_born,
                    family = binomial, data = data_use1)

Check with a probit estimation of the same model and then OLS. Compare the estimates from those models. Are there big differences? Change up the specification to see how much estimates change. Does a cubic in age (or quartic) change other signs or significance?

Compare with simple summary stats (example below, or tidyverse package offers even better options) – what has changed with the more complicated analysis?

data_use1$educ_factor <- as.factor(data_use1$educ_nohs + 2*data_use1$educ_hs + 3*data_use1$educ_smcoll + 4*data_use1$educ_as + 5*data_use1$educ_bach + 6*data_use1$educ_adv)
require(plyr)
notcov_by_educ <- ddply(data_use1,.(educ_factor), summarize,
                                 avg_notcov = mean(NOTCOV, na.rm = TRUE),
                                 n_categ = length(NOTCOV)
)