1. What are the names of the people in your study group?
2. Consider the PUMS data for people in NY, that we’ve been using in class. For now restrict attention to just working people (explain how you might define that). a. Do a statistical test of the difference in average age between working people in the Bronx vs working people in Brooklyn. What is the 95% confidence interval for the difference in means?
for all people mean(Age[in_Bronx == 1])
mean(Age[in_Brooklyn == 1])
sd(Age[in_Bronx == 1])
sd(Age[in_Brooklyn == 1])
length(Age[in_Bronx == 1])
length(Age[in_Brooklyn == 1])
(36.78246 - 37.89583) / sqrt((23.0726^2/10959) + (22.98986^2/24614)) -4.207
pnorm(-4.2) < .01
CI use 1.96 * ( sqrt((23.0726^2/10959) + (22.98986^2/24614))) = .519
so diff is -1.11 +/- .519 = (-1.63,-.595)
b. What if you were using the Age data but regularized so that the min is zero and max is one [recall my function, (X_in-min(X_in,na.rm = TRUE))/abs(max(X_in,na.rm = TRUE)-min(X_in, na.rm =TRUE)) ]. Would the statistical test come out the same? Why or why not?
This depends on the details of how the regularization was performed. If the boroughs are regularized with the same function then the statistical test is unchanged. However if the function is slightly different (eg min or max values are different) then this is not necessarily the case.
3. I used the PUMS data to look at wages and commute type, getting this table for people in the City: (you can answer parts a-c without R)
bus | Car | Subway | |
---|---|---|---|
Wage below $25,000 | 1501 | 2394 | 3704 |
Wage above $75,000 | 385 | 1825 | 2194 |
a. Given that someone takes the bus to work, what is the probability that they’re making wages above $75,000?
385/(1501+385) = .20
b. Given that someone takes the subway to work, what is the probability that they make wages below $25,000?
3704/(3704+2194) = .63
c. Given that someone has wage above $75,000, what is the probability that they drive a car to work?
1825/(385+1825+2194) = .41
d. Using the PUMS data, can you narrow this further - what are the socioeconomics of bus/subway in the various boroughs? What is the wealthiest PUMA area and how do the people living there tend to commute?
Answers will vary
4. Try the machine learning K-nearest-neighbor algorithm on the PUMS data to get another view of the “interesting pattern” from above. (As usual, step one is to replicate my code, then gradually morph it into your own.) How good of a classification can you get?
Answers will vary