Lecture Notes 6

Econ B2000, MA Econometrics

Kevin R Foster, CCNY

Fall 2012

 

 

 

To Recap for univariate OLS:

 and

 so that  and

 

Why OLS?  It has a variety of desirable properties, if the data being analyzed satisfy some very basic assumptions.  Largely because of this (and also because it is quite easy to calculate) it is widely used in many different fields.  (The method of least squares was first developed for astronomy.)

 

Regression Details

 

Hypotheses about regression coefficients: t-stats, p-values, and confidence intervals again!  Usually two-sided (rarely one-sided).

 

We will regularly be testing if the coefficients are significant; i.e. is there evidence in the data that the best estimate of the coefficient is different from zero?  This goes back to our original "Jump into OLS" where we looked at the difference between the Hong Kong/Singapore stock returns and the US stock returns/interest rate.  A zero slope is evidence against any relationship – this shows that the best guess of the value of Y does not depend on current information about the level of X.  So coefficient estimates that are statistically indistinguishable from zero are not evidence that the particular X variable is useful in prediction.

 

A hypothesis test of some statistical estimate uses this estimator (call it  ) and the estimator's standard error (denote it as  ) to test against some null hypothesis value, .  To make the hypothesis test, form , and – here is the magic! – under certain conditions this Z will have a Standard Normal distribution (or sometimes, if there are few degrees of freedom, a t-distribution; later in more advanced stats courses, some other distribution).  The magic happens because if Z has a Standard Normal distribution then this allows me to measure if the estimate of X, , is very far away from .  It's generally tough to specify a common unit that allows me to say sensible things about "how big is big?" without some statistical measure.  The p-value of the null hypothesis tells me, "If the null hypothesis were actually true, how likely is it that I would see this  value?"  A low p-value tells me that it's very unlikely that my hypothesis could be true and yet I'd see the observed values, which is evidence against the null hypothesis.

 

Often the formula, , gets simpler when  is zero, since it is just , and this is what SPSS prints out in the regression output labeled as "t".  This generally has a t-distribution (with enough degrees of freedom, a Standard Normal) so SPSS calculates the area in the tails beyond this value and labels it "Sig".

 

This is in Chapter 5 of Stock & Watson.

 

We know that the standard normal distribution has some important values in it, for example the values that are so extreme, that there is just a 5% chance that we could observe what we saw, yet the true value were actually zero.  This 5% critical value is just below 2, at 1.96.  So if we find a t-statistic that is bigger than 1.96 (in absolute value) then the slope would be "statistically significant"; if we find a t-statistic that is smaller than 1.96 (in absolute value) then the slope would not be "statistically significant".  We can re-write these statements into values of the slope itself instead of the t-statistic.

 

We know from above that

,

and we've just stated that the slope is not statistically significant if:

.

This latter statement is equivalent to:

Which we can re-write as:

Which is equivalent to:

So this gives us a "Confidence Interval" – if we observe a slope within 1.96 standard errors of zero, then the slope is not statistically significant; if we observe a slope farther from zero than 1.96 standard errors, then the slope is statistically significant.

 

This is called a "95% Confidence Interval" because this shows the range within which the observed values would fall, 95% of the time, if the true value were zero.  Different confidence intervals can be calculated with different critical values: a 90% Confidence Interval would need the critical value from the standard normal, so that 90% of the probability is within it (this is 1.64).

 

OLS is nothing particularly special.  The Gauss-Markov Theorem tells us that OLS is BLUE: Best Linear Unbiased Estimator (and need to assume homoskedasticity).  Sounds good, right?  Among the linear unbiased estimators, OLS is "best" (defined as minimizing the squared error).  But this is like being the best-looking economist – best within a very small and very particular group is not worth much!  Nonlinear estimators may be good in various situations, or we might even consider biased estimators.

 

If X is a binary dummy variable

Sometimes the variable X is a binary variable, a dummy, Di, equal to either one or zero (for example, female).  So the model is  can be expressed as .  So this is just saying that Y has mean b0 + b1 in some cases and mean b0 in other cases.  So b1 is interpreted as the difference in mean between the two groups (those with D=1 and those with D=0).  Since it is the difference, it doesn't matter which group is specified as 1 and which is 0 – this just allows measurement of the difference between them.

 

Other 'tricks' of time trends (& functional form)

 

In addition to the standard errors of the slope and intercept estimators, the regression line itself has a standard error. 

 

A commonly overall assessment of the quality of the regression is the R2 (displayed on the charts at the beginning automatically by SPSS).  This is the fraction of the variance in Y that is explained by the model so 0 £ R2 £ 1.  Bigger is usually better, although different models have different expectations (i.e. it's graded on a curve).

 

Statistical significance for a univariate regression is the same as overall regression significance – if the slope coefficient estimate is statistically significantly different from zero, then this is equivalent to the statement that the overall regression explains a statistically significant part of the data variation.

 

-          Excel calculates OLS both as regression (from Data Analysis TookPak), as just the slope and intercept coefficients (formula values), and from within a chart

 

Multiple Regression – more than one X variable

Regressing just one variable on another can be helpful and useful (and provides a great graphical intuition) but it doesn't get us very far.

 

Consider this example, using data from the March 2010 CPS.  We limit ourselves to only examining people with a non-zero annual wage/salary who are working fulltime (WSAL_VAL > 0 & HRCHECK = 2).  We look at the different wages reported by people who label themselves as white, African-American, Asian, Native American, and Hispanic.  There are 62,043 whites, 9,101 African-Americans, 4476 Asians, 2149 Native Americans, and 12,401 Hispanics in the data who fulfill this condition.

 

The average yearly salary for whites is $50,782; for African-Americans it is $39,131; for Asians $57,541; for Native Americans $38,036; for Hispanics it is $36,678.  Conventional statistical tests find that these averages are significantly different.  Does this prove discrimination?  No; there are many other reasons why groups of people could have different incomes such as educational level or age or a multitude of other factors.  (But it is not inconsistent with a hypothesis of racism: remember the difference, when evaluating hypotheses, between 'not rejecting' or 'accepting').  We might reasonably break these numbers down further.

 

These groups of people are different in a variety of ways.  Their average ages are different between Hispanics, averaging 38.72 years, and non-Hispanics, averaging 42.41 years. So how much of the wage difference, for Hispanics, is due to the fact that they're younger?  We could do an ANOVA on this but that would omit other factors.

 

The populations also different in gender ratios.  For whites, 57% were male; for African-Americans 46% were male; for Hispanics 59% were male.  Since gender also affects income, we might think some of the wage gap could be due, not to racial discrimination, but to gender discrimination.

 

But then they're also different in educational attainment!  Among the Hispanic workers, 30% had not finished high school; for African-Americans 8.8% had not; for whites 9% had not finished with a diploma.  And 12% of whites had an advanced degree while 8.3% of African Americans and 4.2% of Hispanics had such credentials.  The different fractions in educational attainment add credibility to the hypothesis that not all racial/ethnic variation means discrimination (in the labor market, at least – there could be discrimination in education so certain groups get less or worse education).

 

Finally they're different in what section of the country they live in, as measured by Census region.

 

So how can we keep all of these different factors straight?

 

Multiple Regression

From the standpoint of just using SPSS, there is no difference for the user between a univariate and multivariate linear regression.  Again use "Analyze\ Regression\ Linear ..." but then add a bunch of variables to the "Independent(s)" box.

 

In formulas, model has k explanatory variables for each of  observations (must have n > k)

Each coefficient estimate, notated as , has standardized distribution as t with (n – k) degrees of freedom.

 

Each coefficient represents the amount by which the y would be expected to change, for a small change in the particular x-variable (i.e.  ).

 

Note that you must be a bit careful specifying the variables.  The CPS codes educational attainment with a bunch of numbers from 31 to 46 but these numbers have no inherent meaning.  So too race, geography, industry, and occupation.  If a person graduates high school then their grade coding changes from 38 to 39 but this must be coded with a dummy variable.  If a person moves from New York to North Dakota then this increases their state code from 36 to 38; this is not the same change as would occur for someone moving from North Dakota to Oklahoma (40) nor is it half of the change as would occur for someone moving from New York to North Carolina (37).  Each state needs a dummy variable.

 

A multivariate regression can control for all of the different changes to focus on each item individually.  So we might model a person's wage/salary value as a function of their age, their gender, race/ethnicity (African-American, Asian, Native American, Hispanic), if they're an immigrant, six educational variables (high school diploma, some college but no degree, Associate's in vocational field, Associate's in academic field, a 4-year degree, or advanced degree), if they're married or divorced/widowed/separated, if they're a union member, and if they're a veteran.  Results (from the sample above, of March 2010 fulltime workers with non-zero wage), are given by SPSS as:

 

 

Model Summary

Model

R

R Square

Adjusted R Square

Std. Error of the Estimate

1

.454a

.206

.206

46820.442

a. Predictors: (Constant), Veteran (any), African American, Education: Associate in vocational, Union member, Education: Associate in academic, Native American Indian or Alaskan or Hawaiian, Divorced or Widowed or Separated, Asian, Education: Advanced Degree, Hispanic, Female, Education: Some College but no degree, Demographics, Age, Education: 4-yr degree, Immigrant, Married, Education: High School Diploma

 

ANOVAb

Model

Sum of Squares

df

Mean Square

F

Sig.

1

Regression

4.416E13

17

2.598E12

1185.074

.000a

Residual

1.704E14

77751

2.192E9

 

 

Total

2.146E14

77768

 

 

 

a. Predictors: (Constant), Veteran (any), African American, Education: Associate in vocational, Union member, Education: Associate in academic, Native American Indian or Alaskan or Hawaiian, Divorced or Widowed or Separated, Asian, Education: Advanced Degree, Hispanic, Female, Education: Some College but no degree, Demographics, Age, Education: 4-yr degree, Immigrant, Married, Education: High School Diploma

b. Dependent Variable: Total wage and salary earnings amount - Person

 

 

Coefficientsa

Model

Unstandardized Coefficients

Standardized Coefficients

t

Sig.

B

Std. Error

Beta

1

(Constant)

10081.754

872.477

 

11.555

.000

Demographics, Age

441.240

15.422

.104

28.610

.000

Female

-17224.424

351.880

-.163

-48.950

.000

African American

-5110.741

539.942

-.031

-9.465

.000

Asian

309.850

819.738

.001

.378

.705

Native American Indian or Alaskan or Hawaiian

-4359.733

1029.987

-.014

-4.233

.000

Hispanic

-3786.424

554.159

-.026

-6.833

.000

Immigrant

-3552.544

560.433

-.026

-6.339

.000

Education: High School Diploma

8753.275

676.683

.075

12.936

.000

Education: Some College but no degree

15834.431

726.533

.116

21.795

.000

Education: Associate in vocational

17391.255

976.059

.072

17.818

.000

Education: Associate in academic

21511.527

948.261

.093

22.685

.000

Education: 4-yr degree

37136.959

712.417

.293

52.128

.000

Education: Advanced Degree

64795.030

788.824

.400

82.141

.000

Married

10981.432

453.882

.102

24.194

.000

Divorced or Widowed or Separated

4210.238

606.045

.028

6.947

.000

Union member

-2828.590

1169.228

-.008

-2.419

.016

Veteran (any)

-2863.140

666.884

-.014

-4.293

.000

a. Dependent Variable: Total wage and salary earnings amount - Person

 

For the "Coefficients" table, the "Unstandardized coefficient B" is the estimate of , the "Std. Error" of the unstandardized coefficient is the standard error of that estimate, .  (In economics we don't generally use the standardized beta, which divides the coefficient estimate by the standard error of X.)  The "t" given in the table is the t-statistic,  and "Sig." is its p-value – the probability, if the coefficient were actually zero, of seeing an estimate as large as the one that you got.  (Go back and review if you don't remember all of the details of this.)

 

So see Excel sheet to show how to get predicted wages for different groups.  Can then interpret the residual from the regression.

 

-          Statistical significance of coefficient estimates is more complicated for multiple regression, we can ask whether a group of variables are jointly significant, which takes a more complicated test.

 

The difference between the overall regression fit and the significance of any particular estimate is that a hypothesis test of one particular coefficient tests if that parameter is zero; is βi = 0?  This uses the t-statistic  and compares it to a Normal or t distribution (depending on the degrees of freedom).  The test of the regression significance tests if ALL of the slope coefficients are simultaneously zero; if β1 = β2 = β3 = ... = βK = 0.  The latter is much more restrictive.  (See Chapter 7 of Stock & Watson.)

 

The predicted value of y is notated as , where .  Its standard error is the standard error of the regression, given by SPSS as "Standard Error of the Estimate."

 

The residual is .  The residual of, for example, a wage regression can be interpreted as the part of the wage that is not explained by the factors within the model.

 

Residuals are often used in analyses of productivity.  Suppose I am analyzing a chain's stores to figure out which are managed best.  I know that there are many reasons for variation in revenues and cost so I can get data on those: how many workers are there and their pay, the location of the store relative to traffic, the rent paid, any sales or promotions going on, etc.  If I run a regression on all of those factors then I get an estimate, , of what profit would have been expected, given external factors.  Then the difference represents the unexplained or residual amount of variation: some stores would have been expected to be profitable and are indeed; some are not living up to potential; some would not have been expected to do so well but something is going on so they're doing much better than expected.

 

Why do we always leave out a dummy variable?  Multicollinearity.  (See Chapter 6 of Stock & Watson.)

 

 

 

Heteroskedasticity-consistent errors

 

You can choose to use heteroskedasticity-consistent errors as in the textbook, using hcreg.sps.