Lecture Notes 9 Econ B2000, MA Econometrics Kevin R Foster, CCNY Fall 2011 |
|
Panel Data
A panel of data contains repeated observations of a single
economic unit over time. This might be a
survey like the CPS where the same person is surveyed each month to investigate
changes in their labor market status.
There are medical panels that have given annual exams to the same people
for decades. Publicly-traded firms that
file their annual reports can provide a panel of data: revenue and sales for
many years at many different firms.
Sometimes data covers larger blocks such as states in the
Other data sets are just cross-sectional, like the March CPS that we're using. If we put together a series of cross-sectional samples that don't follow the same people (so we use the March 2008, 2007, and 2006 CPS samples) then we have a pooled sample. A long stream of data on a single unit is a time series (for example US Industrial Production or the daily returns on a single stock).
In panel data we want to distinguish time from unit effects. Suppose that you are analyzing sales data for a large company's many stores. You want to figure out which stores are well-managed. You know that there are macro trends: some years are good and some are rough, so you don't want to indiscriminately reward everybody in good years (when they just got lucky) and punish them in bad years (when they got unlucky). There are also location effects: a store with a good location will get more traffic and sell more, regardless. So you might consider subtracting the average sales of a particular location away from current sales, to look at deviations from its usual. After doing this for all of the stores, you could subtract off the average deviation at a particular time, too, to account for year effects (if everyone outperforms their usual sales by 10% then it might just indicate a good economy). You would be left with a store's "unusual" sales – better or worse than what would have been predicted for a given store location in that given year.
A regression takes this even further to use all of our usual "prediction" variables in the list of X, and combine these with time and unit fixed effects.
Now the notation begins. Let the t-subscript index time; let j index the unit. So any observations of y and x must be at a particular date and unit; we have and then the k x-variables are each (the superscript for which of the x-variables). So the regression equation is
,
where (alpha) is the fixed effect for each unit j, (gamma) is the time effect, and then the error is unique to each unit at each time.
This is actually easy to implement, even though the notation might look formidable. Just create a dummy variable for each time period and another dummy for each unit and put the whole slew of dummies into the regression.
So, to take a tiny example, suppose you have 8 store locations over 10 years, 1999-2008. You have data on sales (Y) and advertising spending (X) and want to look at the relationship between this simple X and Y. So the data look like this:
X1999,1 |
X1999,2 |
X1999,3 |
X1999,4 |
X1999,5 |
X1999,6 |
X1999,7 |
X1999,8 |
X2000,1 |
X2000,2 |
X2000,3 |
X2000,4 |
X2000,5 |
X2000,6 |
X2000,7 |
X2000,8 |
X2001,1 |
X2001,2 |
X2001,3 |
X2001,4 |
X2001,5 |
X2001,6 |
X2001,7 |
X2001,8 |
X2002,1 |
X2002,2 |
X2002,3 |
X2002,4 |
X2002,5 |
X2002,6 |
X2002,7 |
X2002,8 |
X2003,1 |
X2003,2 |
X2003,3 |
X2003,4 |
X2003,5 |
X2003,6 |
X2003,7 |
X2003,8 |
X2004,1 |
X2004,2 |
X2004,3 |
X2004,4 |
X2004,5 |
X2004,6 |
X2004,7 |
X2004,8 |
X2005,1 |
X2005,2 |
X2005,3 |
X2005,4 |
X2005,5 |
X2005,6 |
X2005,7 |
X2005,8 |
X2006,1 |
X2006,2 |
X2006,3 |
X2006,4 |
X2006,5 |
X2006,6 |
X2006,7 |
X2006,8 |
X2007,1 |
X2007,2 |
X2007,3 |
X2007,4 |
X2007,5 |
X2007,6 |
X2007,7 |
X2007,8 |
X2008,1 |
X2008,2 |
X2008,3 |
X2008,4 |
X2008,5 |
X2008,6 |
X2008,7 |
X2008,8 |
and similarly for the Y-variables. To do the regression, create 9 time dummy variables: D2000, D2001, D2002, D2003, D2004, D2005, D2006, D2007, and D2008. Then create 7 unit dummies, D2, D3, D4, D5, D6, D7, and D8. Then regress the Y on X and these 16 dummy variables.
Then the interpretation of the coefficient on the X variable is the amount by which an increase in X, above its usual value for that unit and above the usual amount for a given year, would increase Y.
One drawback of this type of estimation is that it is not very useful for forecasting, either to try to figure out the sales at some new location or what will be sales overall next year – since we don't know either the new location's fixed effect (the coefficient on D9 or its alpha) or we don't know next year's dummy coefficient (on D2009 or its gamma).
We also cannot put in a variable that varies only on one dimension – for example, we can't add any other information about store location that doesn't vary over time, like its distance from the other stores or other location information. All of that variation is swept up in the firm-level fixed effect. Similarly we can't include macro data that doesn't vary across firm locations like US GDP since all of that variation is collected into the time dummies.
You can get much fancier; there is a whole econometric literature on panel data estimation methods. But simple fixed effects, put into the same OLS regression that we've become accustomed to, can actually get you far.
Binary Dependent Variable Models (Stock & Watson Chapter 9)
Clearly the differences are rather small; it is rare that we might have a serious theoretical justification for one specification rather than the other.
(Note that the logit function given above has standard error of so in the plots I scaled the probit by this factor).
Convergence Information
|
Number of Iterations |
Optimal Solution Found |
PROBIT |
20 |
No(a) |
a Parameter estimates did not converge.
Convergence Information
|
Number of Iterations |
Optimal Solution Found |
PROBIT |
26 |
Yes |
Since the slope, , the change in probability per change in X-variable, is always changing, the simple coefficients of the linear model cannot be interpreted as the slope, as we did in the OLS model. (Just like when we added a squared term, the interpretation of the slope got more complicated.)
Return to the picture to make this much clearer:
The slope at X1 is rather low; the slope at X2 is much steeper.
The effect of the coefficients now interacts with all of the other variables in the model: for example the effect of a person's gender on their probability of having health insurance will depend on other factors like their age and educational level. Women are generally less likely to have their own insurance than men, but how much less? Among young people with very low education, neither men nor women are very likely to be insured; among older people with very high education both are very likely insured. The biggest difference is toward the middle.
For example, very simple logit and probit estimations on the CPS 2008 dataset gives the following coefficient estimates (I am suppressing notation on significance since it is not important here):
|
Logit |
Probit |
female |
-0.428 |
-0.263 |
afam |
0.220 |
0.134 |
asian |
0.252 |
0.153 |
Amindian |
0.012 |
0.007 |
Hispanic |
-0.028 |
-0.015 |
ed_hs |
0.987 |
0.603 |
ed_smcol |
1.180 |
0.724 |
ed_coll |
1.652 |
1.014 |
ed_adv |
1.927 |
1.178 |
marrd |
0.492 |
0.307 |
divwidsp |
0.875 |
0.541 |
union |
1.336 |
0.791 |
veteran |
0.088 |
0.052 |
immig |
-0.277 |
-0.166 |
imm2gen |
-0.067 |
-0.041 |
Intercept |
-1.303 |
-0.802 |
The probability of having health insurance varies for different socioeconomic groups. We can interpret the signs in a straightforward way: the negative coefficients on the "female" variable indicate that women are less likely to have health insurance. Surprisingly, African-Americans are more likely, along with Asians and Native Americans (although the last is not significant). Hispanics are less likely although this is also not significant.
But how large are these differences? For example, how much less likely to have health care are immigrants? It depends on the other variables. Intuitively, if a person is male, highly-educated, married, and unionized then he's probably insured (being an immigrant would them only slightly less so). So the change in probability associated with immigrant status would be low. At the opposite end, a woman without even a high school diploma, who is single, might already be unlikely to be insured. Immigrant status hardly changes this. Only in the middle will there be a big effect.
We can calculate it straightforwardly, though.
Consider, say, a non-immigrant woman with an advanced degree, whose predicted probability of having health insurance is =
=
Summing the 3 relevant coefficients (the intercept, female, and an advanced degree) gives a logit probability of . For an otherwise-identical immigrant woman (also with an advanced degree) the probability is 0.4796, so the change in probability is about 7%.
Comparing the probit estimates, we would just change the functional form (using the normal cdf instead of the logit function) and find a probability for a non-immigrant woman as 0.5447 and the immigrant woman to be 0.4786, with a difference of 6.6%. These estimates from the logit and probit are very close.
Compare the change in probabilities for a married male with an advanced degree who is a union member, who is either an immigrant or not. Now the probability of having insurance is, by the logit, 0.9206 for the non-immigrant and 0.8979 for the immigrant, a change of just 2.3%. From the probit the estimated probabilities are 0.9298 for the non-immigrant and 0.9045 for the immigrant, a change of 2.5%. This is because a married male with an advanced degree who is a union member is already highly likely to have health insurance, so the difference of being an immigrant or not makes only a small change compared with the previous example of a female with a high education (but unmarried and not in a union).
The details of this calculation are in an Excel spreadsheet, probit_logit_results_fromCPS2008.xls, that you can download.