Econ B2000, MA Econometrics
Kevin R Foster, the Colin Powell School at the City College of New York, CUNY
Fall 2019
Using PUMS NY data, consider how outcomes of interest vary with college major choice.
Form a group of 3 (again, invite someone new). Groups should prepare a 4-min presentation by one of the group members about their experiment process and results. You get 45 min to prepare.
This time, we reverse the order of steps from previous: first look at some of the data, then step back and think about it. (Not necessarily stating which order is best, just having you do meta-experiments on what is best way to do experiments)
Load the data into R Studio and find some means for your choice of subgroups. For now don’t get lost in looking at every major but maybe pick a couple of majors to compare (for example, Econ & Psych, but try to pick a different one).
What outcome might you consider? Wages? Of individual or family? Likelihood of working/unemployment? Of working fulltime and full year? Having health insurance? Owning a home? Housing costs as certain fraction of income? Commute time? Note some of those apply to the individual and some to the whole family.
How do those two groups differ, in who chooses that major? Are there differences in gender, race/ethnicity, ancestry, immigration status, or other characteristics?
After browsing around to look at various differences, pick a few that seem particularly interesting/relevant to you and focus there.
Now consider the experiment protocol after peeking at some of the data. You’re implicitly comparing a particular sample, of those people who finished a 4-year degree and reported information. What additional restrictions might you impose? Should you include retired people? People not in the labor force?
Have you dealt with all of the variable coding details such as top-coding?
Looking at subgroups, what is the size of the difference in outcome? What is the standard error of that difference measure? Using your stats knowledge, how confident are you, that the difference is actually there and not an artefact of sampling?
Look at the crosstabs and compute the marginal probabilities. How do those inform? To check if you did it right, compute some of the marginal probabilities using Bayes’ Theorem. Is your crosstab mutually exclusive and exhaustive?
As an exercise, once you normalize the data to the [0,1] interval, as with this function,
norm_varb <- function(X_in) { (X_in-min(X_in,na.rm = TRUE))/abs(max(X_in,na.rm = TRUE)-min(X_in, na.rm =TRUE)) }
How do the p-value and other statistical tests change (if at all)?
What other factors could explain the difference in outcome? Among your list of differences in who chooses the major, are there some potential confounders?
How can you best present this evidence – tables or graphs of different types? (Hint probably not raw R output, blech) McCloskey reminds that virtuosity can substitute for virtue in presenting an argument.
What additional evidence would you look at? What conclusions could you draw from that? How confident would you be, in the conclusions made? What other conclusions could be drawn, from that same evidence? If you were to try to persuade someone, imagine what evidence would be required to persuade a person with the opposite view?