Homework #3 Solutions

Kevin R Foster, CCNY

1. What are the names of the people in your study group? 2. Using the PUMS data, consider some statistical tests: a. What is the average age of people in Brooklyn? In Queens? Is there a statisitcally signficant difference? b. Create confidence intervals for each, as well as the difference. Explain.

for all people

mean(Age[in_Queens == 1])

mean(Age[in_Brooklyn == 1])

sd(Age[in_Queens == 1])

sd(Age[in_Brooklyn == 1])

length(Age[in_Queens == 1])

length(Age[in_Brooklyn == 1])

(40.43508 - 37.89583) / sqrt((22.91347^2/21325) + (22.98986^2/24614)) 11.82734

pnorm(11.82734) = 1 so about zero in tails

95 CI for diff use 1.96 * ( sqrt((22.91347^2/21325) + (22.98986^2/24614))) = .42

so diff is 11.83 +/- .42 = (11.41, 12.25)

c. Did you snip off the top-coded people? Re-do the test without those people. How does the p-value change?

easy to see top-coding with hist(Age[Age > 70])

mean(norm_varb(Age[(in_Queens == 1) & (Age < 90)]))

mean(norm_varb(Age[(in_Brooklyn == 1) & (Age < 90)]))

sd(Age[(in_Queens == 1) & (Age < 90)])

sd(Age[(in_Brooklyn == 1) & (Age < 90)])

length(Age[(in_Queens == 1) & (Age < 90)])

length(Age[(in_Brooklyn == 1) & (Age < 90)])

you can do the other calcs

d. Now supposed you normalized all of the ages to the [0,1] interval, as with this function,

norm_varb <- function(X_in) { (X_in-min(X_in,na.rm = TRUE))/abs(max(X_in,na.rm = TRUE)-min(X_in, na.rm =TRUE)) } Is there a statistically significant difference? How does the p-value change? Explain how you dealt with the top-coding.

e.g. mean(norm_varb(Age[(in_Queens == 1) & (Age < 90)]))

mean(norm_varb(Age[(in_Brooklyn == 1) & (Age < 90)]))

then do out the rest.

e. Based on your knowledge of those boroughs, can you explain the results? Can you break out the differences if you used age ranges? What are the fractions of children in each borough? Older people? Are these statistically significant? f. Going more granular, can you look at all of these differences by neighborhood within each borough? At what point does this get into p-hacking? g. What would be a good way to show all of these differences graphically? Answers will vary

3. I used the PUMS data to look at wages and commute type, getting this table for people in the City: (you can answer parts a-c without R)

w	bus	car	subway
Wage below $25,000	1501	2394	3704
Wage above $75,000	385	1825	2194

These were from:

dat_NYC <- subset(dat_pums_NY, (dat_pums_NY$in_NYC == 1)&(dat_pums_NY$Age >= 25)&(dat_pums_NY$Age <= 55))

m_bronx <- mean(dat_NYC$Age[dat_NYC$in_Bronx == 1])

m_brooklyn <- mean(dat_NYC$Age[dat_NYC$in_Brooklyn == 1])

s_bronx <- sd(dat_NYC$Age[dat_NYC$in_Bronx == 1])

s_brooklyn <- sd(dat_NYC$Age[dat_NYC$in_Brooklyn == 1])

l_bronx <- length(dat_NYC$Age[dat_NYC$in_Bronx == 1])

l_brooklyn <- length(dat_NYC$Age[dat_NYC$in_Brooklyn == 1])

s_diff <- sqrt(s_bronx^2/l_bronx + s_brooklyn^2/l_brooklyn)

m_diff <- m_bronx - m_brooklyn

m_diff/s_diff

a. Given that someone takes the bus to work, what is the probability that they’re making wages above $75,000?

385/(1501+385) = .20

b. Given that someone takes the subway to work, what is the probability that they make wages below $25,000?

3704/(3704+2194) = .63

c. Given that someone has wage above $75,000, what is the probability that they drive a car to work?

1825/(385+1825+2194) = .41

d. Using the PUMS data, can you narrow this further - what are the socioeconomics of bus/subway in the various boroughs? What is the wealthiest PUMA area and how do the people living there tend to commute? e. Try the machine learning K-nearest-neighbor algorithm on the PUMS data to get another view of the commuting pattern from above. How good of a classification of commute type can you get? Explain what you believe are important variables in this classification. You might explore the “caret” function.

Answers will vary

Homework #3 Solutions

Suggestions for solutions

Econ B2000, MA Econometrics

Kevin R Foster, CCNY