Lecture Notes 7

Econ B2000, MA Econometrics

Kevin R Foster, CCNY

Fall 2011

 

 

 

To Recap for univariate OLS:

 and

 so that  and

 

Why OLS?  It has a variety of desirable properties, if the data being analyzed satisfy some very basic assumptions.  Largely because of this (and also because it is quite easy to calculate) it is widely used in many different fields.  (The method of least squares was first developed for astronomy.)

 

Regression Details

 

Hypotheses about regression coefficients: t-stats, p-values, and confidence intervals again!  Usually two-sided (rarely one-sided).

 

We will regularly be testing if the coefficients are significant; i.e. is there evidence in the data that the best estimate of the coefficient is different from zero?  This goes back to our original "Jump into OLS" where we looked at the difference between the Hong Kong/Singapore stock returns and the US stock returns/interest rate.  A zero slope is evidence against any relationship – this shows that the best guess of the value of Y does not depend on current information about the level of X.  So coefficient estimates that are statistically indistinguishable from zero are not evidence that the particular X variable is useful in prediction.

 

A hypothesis test of some statistical estimate uses this estimator (call it  ) and the estimator's standard error (denote it as  ) to test against some null hypothesis value, .  To make the hypothesis test, form , and – here is the magic! – under certain conditions this Z will have a Standard Normal distribution (or sometimes, if there are few degrees of freedom, a t-distribution; later in more advanced stats courses, some other distribution).  The magic happens because if Z has a Standard Normal distribution then this allows me to measure if the estimate of X, , is very far away from .  It's generally tough to specify a common unit that allows me to say sensible things about "how big is big?" without some statistical measure.  The p-value of the null hypothesis tells me, "If the null hypothesis were actually true, how likely is it that I would see this  value?"  A low p-value tells me that it's very unlikely that my hypothesis could be true and yet I'd see the observed values, which is evidence against the null hypothesis.

 

Often the formula, , gets simpler when  is zero, since it is just , and this is what SPSS prints out in the regression output labeled as "t".  This generally has a t-distribution (with enough degrees of freedom, a Standard Normal) so SPSS calculates the area in the tails beyond this value and labels it "Sig".

 

This is in Chapter 5 of Stock & Watson.

 

We know that the standard normal distribution has some important values in it, for example the values that are so extreme, that there is just a 5% chance that we could observe what we saw, yet the true value were actually zero.  This 5% critical value is just below 2, at 1.96.  So if we find a t-statistic that is bigger than 1.96 (in absolute value) then the slope would be "statistically significant"; if we find a t-statistic that is smaller than 1.96 (in absolute value) then the slope would not be "statistically significant".  We can re-write these statements into values of the slope itself instead of the t-statistic.

 

We know from above that

,

and we've just stated that the slope is not statistically significant if:

.

This latter statement is equivalent to:

Which we can re-write as:

Which is equivalent to:

So this gives us a "Confidence Interval" – if we observe a slope within 1.96 standard errors of zero, then the slope is not statistically significant; if we observe a slope farther from zero than 1.96 standard errors, then the slope is statistically significant.

 

This is called a "95% Confidence Interval" because this shows the range within which the observed values would fall, 95% of the time, if the true value were zero.  Different confidence intervals can be calculated with different critical values: a 90% Confidence Interval would need the critical value from the standard normal, so that 90% of the probability is within it (this is 1.64).

 

OLS is nothing particularly special.  The Gauss-Markov Theorem tells us that OLS is BLUE: Best Linear Unbiased Estimator (and need to assume homoskedasticity).  Sounds good, right?  Among the linear unbiased estimators, OLS is "best" (defined as minimizing the squared error).  But this is like being the best-looking economist – best within a very small and very particular group is not worth much!  Nonlinear estimators may be good in various situations, or we might even consider biased estimators.

 

If X is a binary dummy variable

Sometimes the variable X is a binary variable, a dummy, Di, equal to either one or zero (for example, female).  So the model is  can be expressed as .  So this is just saying that Y has mean b0 + b1 in some cases and mean b0 in other cases.  So b1 is interpreted as the difference in mean between the two groups (those with D=1 and those with D=0).  Since it is the difference, it doesn't matter which group is specified as 1 and which is 0 – this just allows measurement of the difference between them.

 

Other 'tricks' of time trends (& functional form)

 

In addition to the standard errors of the slope and intercept estimators, the regression line itself has a standard error. 

 

A commonly overall assessment of the quality of the regression is the R2 (displayed on the charts at the beginning automatically by SPSS).  This is the fraction of the variance in Y that is explained by the model so 0 £ R2 £ 1.  Bigger is usually better, although different models have different expectations (i.e. it's graded on a curve).

 

Statistical significance for a univariate regression is the same as overall regression significance – if the slope coefficient estimate is statistically significantly different from zero, then this is equivalent to the statement that the overall regression explains a statistically significant part of the data variation.

 

-          Excel calculates OLS both as regression (from Data Analysis TookPak), as just the slope and intercept coefficients (formula values), and from within a chart

 

Multiple Regression – more than one X variable

Regressing just one variable on another can be helpful and useful (and provides a great graphical intuition) but it doesn't get us very far.

 

Consider this example, using data from the March 2010 CPS.  We limit ourselves to only examining people with a non-zero annual wage/salary who are working fulltime (WSAL_VAL > 0 & HRCHECK = 2).  We look at the different wages reported by people who label themselves as white, African-American, Asian, Native American, and Hispanic.  There are 62,043 whites, 9,101 African-Americans, 4476 Asians, 2149 Native Americans, and 12,401 Hispanics in the data who fulfill this condition.

 

The average yearly salary for whites is $50,782; for African-Americans it is $39,131; for Asians $57,541; for Native Americans $38,036; for Hispanics it is $36,678.  Conventional statistical tests find that these averages are significantly different.  Does this prove discrimination?  No; there are many other reasons why groups of people could have different incomes such as educational level or age or a multitude of other factors.  (But it is not inconsistent with a hypothesis of racism: remember the difference, when evaluating hypotheses, between 'not rejecting' or 'accepting').  We might reasonably break these numbers down further.

 

These groups of people are different in a variety of ways.  Their average ages are different between Hispanics, averaging 38.72 years, and non-Hispanics, averaging 42.41 years. So how much of the wage difference, for Hispanics, is due to the fact that they're younger?  We could do an ANOVA on this but that would omit other factors.

 

The populations also different in gender ratios.  For whites, 57% were male; for African-Americans 46% were male; for Hispanics 59% were male.  Since gender also affects income, we might think some of the wage gap could be due, not to racial discrimination, but to gender discrimination.

 

But then they're also different in educational attainment!  Among the Hispanic workers, 30% had not finished high school; for African-Americans 8.8% had not; for whites 9% had not finished with a diploma.  And 12% of whites had an advanced degree while 8.3% of African Americans and 4.2% of Hispanics had such credentials.  The different fractions in educational attainment add credibility to the hypothesis that not all racial/ethnic variation means discrimination (in the labor market, at least – there could be discrimination in education so certain groups get less or worse education).

 

Finally they're different in what section of the country they live in, as measured by Census region.

 

So how can we keep all of these different factors straight?

 

Multiple Regression

From the standpoint of just using SPSS, there is no difference for the user between a univariate and multivariate linear regression.  Again use "Analyze\ Regression\ Linear ..." but then add a bunch of variables to the "Independent(s)" box.

 

In formulas, model has k explanatory variables for each of  observations (must have n > k)

Each coefficient estimate, notated as , has standardized distribution as t with (n – k) degrees of freedom.

 

Each coefficient represents the amount by which the y would be expected to change, for a small change in the particular x-variable (i.e.  ).

 

Note that you must be a bit careful specifying the variables.  The CPS codes educational attainment with a bunch of numbers from 31 to 46 but these numbers have no inherent meaning.  So too race, geography, industry, and occupation.  If a person graduates high school then their grade coding changes from 38 to 39 but this must be coded with a dummy variable.  If a person moves from New York to North Dakota then this increases their state code from 36 to 38; this is not the same change as would occur for someone moving from North Dakota to Oklahoma (40) nor is it half of the change as would occur for someone moving from New York to North Carolina (37).  Each state needs a dummy variable.

 

A multivariate regression can control for all of the different changes to focus on each item individually.  So we might model a person's wage/salary value as a function of their age, their gender, race/ethnicity (African-American, Asian, Native American, Hispanic), if they're an immigrant, six educational variables (high school diploma, some college but no degree, Associate's in vocational field, Associate's in academic field, a 4-year degree, or advanced degree), if they're married or divorced/widowed/separated, if they're a union member, and if they're a veteran.  Results (from the sample above, of March 2010 fulltime workers with non-zero wage), are given by SPSS as:

 

 

Model Summary

Model

R

R Square

Adjusted R Square

Std. Error of the Estimate

1

.454a

.206

.206

46820.442

a. Predictors: (Constant), Veteran (any), African American, Education: Associate in vocational, Union member, Education: Associate in academic, Native American Indian or Alaskan or Hawaiian, Divorced or Widowed or Separated, Asian, Education: Advanced Degree, Hispanic, Female, Education: Some College but no degree, Demographics, Age, Education: 4-yr degree, Immigrant, Married, Education: High School Diploma

 

ANOVAb

Model

Sum of Squares

df

Mean Square

F

Sig.

1

Regression

4.416E13

17

2.598E12

1185.074

.000a

Residual

1.704E14

77751

2.192E9

 

 

Total

2.146E14

77768

 

 

 

a. Predictors: (Constant), Veteran (any), African American, Education: Associate in vocational, Union member, Education: Associate in academic, Native American Indian or Alaskan or Hawaiian, Divorced or Widowed or Separated, Asian, Education: Advanced Degree, Hispanic, Female, Education: Some College but no degree, Demographics, Age, Education: 4-yr degree, Immigrant, Married, Education: High School Diploma

b. Dependent Variable: Total wage and salary earnings amount - Person

 

 

Coefficientsa

Model

Unstandardized Coefficients

Standardized Coefficients

t

Sig.

B

Std. Error

Beta

1

(Constant)

10081.754

872.477

 

11.555

.000

Demographics, Age

441.240

15.422

.104

28.610

.000

Female

-17224.424

351.880

-.163

-48.950

.000

African American

-5110.741

539.942

-.031

-9.465

.000

Asian

309.850

819.738

.001

.378

.705

Native American Indian or Alaskan or Hawaiian

-4359.733

1029.987

-.014

-4.233

.000

Hispanic

-3786.424

554.159

-.026

-6.833

.000

Immigrant

-3552.544

560.433

-.026

-6.339

.000

Education: High School Diploma

8753.275

676.683

.075

12.936

.000

Education: Some College but no degree

15834.431

726.533

.116

21.795

.000

Education: Associate in vocational

17391.255

976.059

.072

17.818

.000

Education: Associate in academic

21511.527

948.261

.093

22.685

.000

Education: 4-yr degree

37136.959

712.417

.293

52.128

.000

Education: Advanced Degree

64795.030

788.824

.400

82.141

.000

Married

10981.432

453.882

.102

24.194

.000

Divorced or Widowed or Separated

4210.238

606.045

.028

6.947

.000

Union member

-2828.590

1169.228

-.008

-2.419

.016

Veteran (any)

-2863.140

666.884

-.014

-4.293

.000

a. Dependent Variable: Total wage and salary earnings amount - Person

 

For the "Coefficients" table, the "Unstandardized coefficient B" is the estimate of , the "Std. Error" of the unstandardized coefficient is the standard error of that estimate, .  (In economics we don't generally use the standardized beta, which divides the coefficient estimate by the standard error of X.)  The "t" given in the table is the t-statistic,  and "Sig." is its p-value – the probability, if the coefficient were actually zero, of seeing an estimate as large as the one that you got.  (Go back and review if you don't remember all of the details of this.)

 

So see Excel sheet to show how to get predicted wages for different groups.  Can then interpret the residual from the regression.

 

-          Statistical significance of coefficient estimates is more complicated for multiple regression, we can ask whether a group of variables are jointly significant, which takes a more complicated test.

 

The difference between the overall regression fit and the significance of any particular estimate is that a hypothesis test of one particular coefficient tests if that parameter is zero; is βi = 0?  This uses the t-statistic  and compares it to a Normal or t distribution (depending on the degrees of freedom).  The test of the regression significance tests if ALL of the slope coefficients are simultaneously zero; if β1 = β2 = β3 = ... = βK = 0.  The latter is much more restrictive.  (See Chapter 7 of Stock & Watson.)

 

The predicted value of y is notated as , where .  Its standard error is the standard error of the regression, given by SPSS as "Standard Error of the Estimate."

 

The residual is .  The residual of, for example, a wage regression can be interpreted as the part of the wage that is not explained by the factors within the model.

 

Residuals are often used in analyses of productivity.  Suppose I am analyzing a chain's stores to figure out which are managed best.  I know that there are many reasons for variation in revenues and cost so I can get data on those: how many workers are there and their pay, the location of the store relative to traffic, the rent paid, any sales or promotions going on, etc.  If I run a regression on all of those factors then I get an estimate, , of what profit would have been expected, given external factors.  Then the difference represents the unexplained or residual amount of variation: some stores would have been expected to be profitable and are indeed; some are not living up to potential; some would not have been expected to do so well but something is going on so they're doing much better than expected.

 

Why do we always leave out a dummy variable?  Multicollinearity.  (See Chapter 6 of Stock & Watson.)

 

 

 

Heteroskedasticity-consistent errors

 

The textbook always uses heteroskedasticity-consistent errors (sometimes called Eicker-Huber-White errors, after the authors who figured out how to calculate them).  However SPSS does not have an internal option on a drop-down list to compute heteroskedasticity-consistent standard errors.  However with just a bit more work we can still produce the desired output.

 

A few notes, for those who are interested (skip this if you're not interested): Why wouldn't SPSS compute these errors?  SPSS is widely used across the social sciences but it is more prevalent in psychology and less in economics.  Economists historically worried more about heteroskedasticity while psychologists worried about other things, thus the slight difference in focus.  How is this problem solved?  SPSS was originally a programming language: instead of picking "Analyze\Regression\Linear" from some lists and then using the mouse to point variables in and out, you might have written a line of code like "REG dv=Y / iv = X".  The drop-down lists are essentially scripting up this code for you.  If you use SPSS often you might find it easier to write the program – it is a tradeoff of taking a larger fixed cost (initially learning the code) for a smaller marginal cost (running a large batch of different analyses).

 

How can we get heteroskedasticity consistent standard errors?  Google (our goddess).  I found an SPSS macro, written by Andrew F. Hayes at Ohio State University, who wrote the code and provided documentation.  Download the macro, hcreg.sps, (from InYourClass, in the "Kevin Foster SPSS" Group) and start up SPSS.  Before you do the regressions, click "File" then "Open" then "Syntax…".  Find the file that you downloaded (hcreg.sps) and open it.  This will open the SPSS Syntax Editor.  All you need to do is choose "Run" from the top menu then "All".  There should not be any errors.  You need to run this macro each time you start up SPSS but it will stay in memory for the entire session until you close SPSS. 

 

The macro does not add extra options to the menus, however.  To use the new functionality we need to write a bit of SPSS syntax ourselves.  For example, suppose we are using the PUMS dataset and want to regress commute time (JWMNP) on other important variables, such as Age, gender, race/ethnicity, education, and borough.

 

We will have to use the "Name" of the variable rather than the label.  This is inconvenient but not a terrible challenge.  Age conveniently has name "Age" but the gender dummy has name "female"; the race/ethnicity variables are "africanamerican" "nativeamerican" "asianamerican" "raceother" and "Hispanic"; education is "educ_hs" "educ_somecoll" "educ_collassoc" "educ_coll" and "educ_adv"; boroughs are "boro_bx" "boro_si" "boro_bk" and "boro_qns".  (Note that we leave one out for education and borough.)

 

Go back to the SPSS Syntax Editor: from the Data View choose "File" "New" "Syntax".  This will re-open the editor on a blank page.  Type:

 

HCREG dv = JWMNP/iv = Age female africanamerican nativeamerican asianamerican raceother Hispanic educ_hs educ_somecoll educ_collassoc educ_coll educ_adv boro_bx boro_si boro_bk boro_qns.

 

Then go to "Run" on the top menu and choose "All" and watch it spit out the output.

 

Your output should look like this,

 

Run MATRIX procedure:

 

HC Method

 3

 

Criterion Variable

 JWMNP

 

Model Fit:

       R-sq          F        df1        df2          p

      .0475   491.2978    16.0000 132326.000      .0000

 

Heteroscedasticity-Consistent Regression Results

              Coeff     SE(HC)          t      P>|t|

Constant    26.7397      .3700    72.2637      .0000

Age           .0450      .0054     8.3550      .0000

female       -.2820      .1404    -2.0085      .0446

africana     7.9424      .1999    39.7312      .0000

nativeam     4.2621     1.3060     3.2635      .0011

asianame     5.2494      .2270    23.1237      .0000

raceothe     3.5011      .2720    12.8696      .0000

Hispanic     1.9585      .2269     8.6317      .0000

educ_hs     -1.1125      .2701    -4.1192      .0000

educ_som     -.7601      .2856    -2.6611      .0078

educ_col      .2148      .3495      .6145      .5389

educ_c_1     1.1293      .2720     4.1517      .0000

educ_adv    -1.3747      .2847    -4.8281      .0000

boro_bx      8.3718      .2564    32.6485      .0000

boro_si     12.7391      .3643    34.9712      .0000

boro_bk      9.6316      .1882    51.1675      .0000

boro_qns    10.2350      .1932    52.9754      .0000

 

------ END MATRIX -----

 

Did that seem like a pain?  OK, here's an easier way that also adds some more error-checking so is more robust.

 

First do a regular OLS regression with drop-down menus in SPSS.  Do the same regression as above, with travel time as dependent and the other variables as independent, and note that just before the output you'll see something like this,

 

REGRESSION  

/MISSING LISTWISE  

/STATISTICS COEFF OUTS R ANOVA  

/CRITERIA=PIN(.05) POUT(.10)   

/NOORIGIN  

/DEPENDENT JWMNP  

/METHOD=ENTER Age female africanamerican nativeamerican asianamerican raceother Hispanic educ_hs educ_somecoll educ_collassoc educ_coll educ_adv boro_bx boro_si boro_bk boro_qns.

 

This is the SPSS code that your drop-down menus created.  You can ignore most of it but realize that it gives a list of all of the variable names (after "/METHOD=ENTER ") so you can do this regression and just copy-and-paste that generated list into the hcreg syntax.

 

The other advantage of doing it this way first is that this will point out any errors you make.  If you put in too many dummy variables then SPSS will take one out (and note that in "Variables Removed" at the beginning of the output).  If that happens then take that out of the list from hcreg or else that will cause errors.  If the SPSS regression finds other errors then those must be fixed first before using the hcreg syntax.

 

The general template for this command is "HCREG", the name of the macro, then "DV = " with the name of the Dependent Variable, "IV = " with the names of the Independent Variables, and then a period to mark the end of a command line.

 

The macro actually allows some more fanciness.  It contains 4 different methods of computing the heteroskedasticity-consistent errors.  If you follow the "IV = " list with "/method = " and a number from 1 to 5 then you will get slightly different errors.  The default is method 3.  If you type "/method = 5" then it will give the homoskedastic errors (the same results as if you did the ordinary regression with the SPSS menus).

 

The macro additionally allows you to set the constant term equal to zero by adding "/constant = 0"; "/covmat = 1" to print the entire covariance matrix; or "/test = q" to test if the last q variables all have coefficients equal to zero.  Prof. Hayes did a very nice job, didn't he?  Go to his web page for complete documentation.

 

The Syntax Editor can be useful for particular tasks, especially those that are repetitive.  Many of the drop-down commands offer a choice of "Paste Syntax" which will show the syntax for the command that you just implicitly created with the menus, which allows you to begin to learn some of the commands.  The Syntax Editor also allows you to save the list of commands if you're doing them repeatedly.

 

This syntax, to perform the regressions, is

 

HCREG dv = JWMNP/iv = Age female africanamerican nativeamerican asianamerican raceother Hispanic educ_hs educ_somecoll educ_collassoc educ_coll educ_adv boro_bx boro_si boro_bk boro_qns.

 

HCREG dv = JWMNP/iv = Age female africanamerican nativeamerican asianamerican raceother Hispanic educ_hs educ_somecoll educ_collassoc educ_coll educ_adv boro_bx boro_si boro_bk boro_qns

/method = 5 .

 

Do those in SPSS and with the regression with the drop menus for comparison.  You will see that the results, between the homoskedastic method=5 and the choosen-from-drop-lists, are identical.  More precisely, all of the coefficient estimates are the same in every version but the standard errors (and therefore t statistics and thus p-values or Sig) are different between the two hcreg versions (but hcreg method 5 delivers the same results as SPSS's drop down menus).