SPSS Examples

Econ 29000, Principles of Statistics

Kevin R Foster, CCNY

Spring 2011

 

 

Hypothesis Tests

 

Using the ATUS data, we want to compare the amount of time people spend with kids.

 

So load the ATUS dataset into SPSS to begin with a screen like this:

That just shows the data.

 

Run the syntax file that I gave, "classification.sps" – "File \ Open \ Syntax " to get this:

Then "Run \ All" from that menu.

 

This creates some broad classifications – you could make your own but for now this allows us to all be looking at the data in the same way.

 

One of the created variables is "t_kids" the time spent with children (either household's children or non-household children).

 

We will look at how education levels change the amount of time people spend with their children.  There are two separate decisions that change this time: first, does the household have any children at all; second, if they have kids, how much time do they spend?

 

Begin by looking at the first: what fraction have kids?  We want to know, of households that could have kids, what fraction have them?  This "of households that could have kids" is important since we know that elderly people are unlikely to have kids (although it can happen; they might have custody of grandchildren).  But it's plausible to restrict the analysis to just people from, say, 20-50 years old.

 

So "Data \ Select Cases" brings this screen:

And click "If Condition is Satisfied" and the button for "If ..." to bring up this screen:

Type and/or use the buttons to create " (TEAGE > 20) & (TEAGE < 50)" to select people in the 20-50 year old group.  (Could use >= and <=; you could argue for different age ranges; all of these variations can be explored by you later.)  Click "Continue" and then "OK" and SPSS will display some output verifying that this was done.

 

Now I will use "Analyze \ Descriptive Statistics \ Explore" (shortcut is discussed below).  The "Dependent List" is "has_kids" and the "Factor List" is "education categories".  (I choose the radio button at the bottom to just display "Statistics" not "Plots" but this is just my choice.)

 

This gives this output:

 

EXAMINE VARIABLES=has_kids BY ed_categories   /PLOT NONE   /STATISTICS DESCRIPTIVES   /CINTERVAL 95   /MISSING LISTWISE   /NOTOTAL.

 

Explore

 

education categories

 

Case Processing Summary

 

education categories

Cases

 

Valid

Missing

Total

 

N

Percent

N

Percent

N

Percent

Has children

less than high school

4700

100.0%

0

.0%

4700

100.0%

high school diploma

13223

100.0%

0

.0%

13223

100.0%

some college

15465

100.0%

0

.0%

15465

100.0%

college degree

12388

100.0%

0

.0%

12388

100.0%

advanced degree

5796

100.0%

0

.0%

5796

100.0%

 

Descriptives

 

education categories

Statistic

Std. Error

Has children

less than high school

Mean

.7377

.00642

95% Confidence Interval for Mean

Lower Bound

.7251

 

Upper Bound

.7502

 

5% Trimmed Mean

.7641

 

Median

1.0000

 

Variance

.194

 

Std. Deviation

.43995

 

Minimum

.00

 

Maximum

1.00

 

Range

1.00

 

Interquartile Range

1.00

 

Skewness

-1.081

.036

Kurtosis

-.832

.071

high school diploma

Mean

.6936

.00401

95% Confidence Interval for Mean

Lower Bound

.6858

 

Upper Bound

.7015

 

5% Trimmed Mean

.7152

 

Median

1.0000

 

Variance

.213

 

Std. Deviation

.46100

 

Minimum

.00

 

Maximum

1.00

 

Range

1.00

 

Interquartile Range

1.00

 

Skewness

-.840

.021

Kurtosis

-1.294

.043

some college

Mean

.6779

.00376

95% Confidence Interval for Mean

Lower Bound

.6705

 

Upper Bound

.6852

 

5% Trimmed Mean

.6976

 

Median

1.0000

 

Variance

.218

 

Std. Deviation

.46731

 

Minimum

.00

 

Maximum

1.00

 

Range

1.00

 

Interquartile Range

1.00

 

Skewness

-.761

.020

Kurtosis

-1.421

.039

college degree

Mean

.6606

.00425

95% Confidence Interval for Mean

Lower Bound

.6522

 

Upper Bound

.6689

 

5% Trimmed Mean

.6784

 

Median

1.0000

 

Variance

.224

 

Std. Deviation

.47354

 

Minimum

.00

 

Maximum

1.00

 

Range

1.00

 

Interquartile Range

1.00

 

Skewness

-.678

.022

Kurtosis

-1.540

.044

advanced degree

Mean

.6839

.00611

95% Confidence Interval for Mean

Lower Bound

.6719

 

Upper Bound

.6959

 

5% Trimmed Mean

.7044

 

Median

1.0000

 

Variance

.216

 

Std. Deviation

.46498

 

Minimum

.00

 

Maximum

1.00

 

Range

1.00

 

Interquartile Range

1.00

 

Skewness

-.791

.032

Kurtosis

-1.374

.064

 

 

Which is rather long because it gives us so many measures!  We might want a bit of a shortcut, once we figure out which measures we're truly interested in.

 

So use "Analyze \ Reports \ Case Summaries" and put "has_kids" into "Variables" and "education_categories" into "Grouping Variable(s)".  Un-check "Display Cases" and click "Statistics" to choose Number of Cases, Mean, and Standard Deviation.

Then "Continue" then,

"OK" which will run and give this output:

 

Summarize

Case Processing Summary

 

Cases

 

Included

Excluded

Total

 

N

Percent

N

Percent

N

Percent

Has children  * education categories

51572

100.0%

0

.0%

51572

100.0%

 

Case Summaries

Has children

education categories

N

Mean

Std. Deviation

less than high school

4700

.7377

.43995

high school diploma

13223

.6936

.46100

some college

15465

.6779

.46731

college degree

12388

.6606

.47354

advanced degree

5796

.6839

.46498

Total

51572

.6839

.46497

 

Which is much easier to read.  From either output, we can see that 74% of people who are 20-50 years old have kids, 69% of people with a high-school diploma, 68% with some college, etc.

 

We want to test: is the fraction of people with kids different by educational qualification?  Is 74% a "big" difference from 69%, or could it just be due to random error?

 

For this simple test we find the difference is 0.7377 – 0.6936 = 0.0441.  To find the standard error of this difference in means, we first find the standard error of each mean, which is 0.43995/sqrt(4700) = 0.006417 and 0.46100/sqrt(13223) = 0.004009.  To find the standard error of the difference in means, square each standard error, add them, and take the square root.  With  (where se(A) is the standard error of the average of A, σA is the standard deviation of the sample A, and NA is the number of observations in sample A) and analogously , find the standard error of the difference as  = 0.007567.

 

So find the Z-statistic, the standardized value of the difference in the means, as the actual difference, 0.0441, minus the difference hypothesized by the null (which is zer0), divided by its standard error, so = 5.83.

 

What is the probability, that the true value is zero, and I could observe a value as large (in absolute value) as 5.83?  The area in the tails beyond 5.83 is, from taking a look at the graph,

emetrics_m_fig4

really tiny since 5.83 is off the edge of the picture.  Use Excel to find that this is 0.000000006 – which is zero, if rounded to 3 or 4 decimal places.  So there is essentially a zero probability that there could actually be no difference, yet we would observe a 3 percentage point difference in the data.  We reject the null hypothesis that there is no difference; the data allow us to conclude that there is a big difference.

 

You can and should be able to do the rest of the tests for the other educational categorizations.

 

Now run the Case Summary with "t_kids" instead of just has kids, and find

 

Case Summaries

time with children (own and others)

education categories

N

Mean

Std. Deviation

less than high school

4700

47.6104

94.33761

high school diploma

13223

48.9654

89.57632

some college

15465

53.9291

93.87316

college degree

12388

65.2511

101.45307

advanced degree

5796

70.0430

103.82020

Total

51572

56.6112

96.21166

 

Now we see a steady rise, that people with more education spend more time with kids.  Again we can ask if these difference are significant: is the mean for "less than high school" a big difference from mean for "high school diploma"?

 

Again find the Z-score.  The difference is 47.6104 – 48.9654 = -1.355.  The standard error of the first is 94.33761/sqrt(4700) = 1.376; the standard error of the second is 89.57632/sqrt(13223) = 0.779.  The standard error of the difference is sqrt( 1.3762 + 0.7792) = 1.581.  So the Z-score of the difference in time is -1.355/1.581 = -.857.

What is now the probability, if there were actually no difference, of seeing a Z-score as large (in absolute value) as -0.857?  This is the area in the tails farther from zero than ±0.857,

emetrics_m_fig4

Which, from NORMSDIST(-0.857), has .196 area in the left tail and an equivalent area in the right, so the overall probability is 0.392 – almost a 40% chance of seeing such a difference, if there were actually zero difference.  So we do not reject the null hypothesis – we cannot conclude that there is a big difference.

 

You can and should do those tests for the other classifications.  Note that you can do lots of pairwise comparisons (no HS vs advanced degree) and so you might end up worrying if this is really fair (we'll get to that; it's not quite right).  You could also use other scales such as the differences as a percent change – a 1.4 minute difference doesn't sound big but 1.355/47.6 is a 2.8% difference (or calculate that 1.4 minutes per day is about 8.25 hours per year; or if we figure about 14% of the US population of 300m is in this category, then this could be blown up to nearly 40,000 years – which makes the statistic sound terrifying!  (A reminder about how to lie with statistics.)

 

Then the interesting further question becomes: if higher-education households spend more time with kids, what are they doing less of – i.e. how do they manage it?  Is this less time spent doing chores (maybe hiring someone to do these)?  Alternately, is this a story of gender – do less-educated men spend less time with kids, in a more traditional gender role?  You can pursue these questions for yourself.