All the Details of SPSS

Using SPSS (PASW)

Econ 29000

Kevin R Foster, CCNY

Spring 2011

PASW and SPSS are the same program, the publisher just changed the title from SPSS to PASW. In the computer lab you'll find it as SPSS.

Why learn this program? There are many available; I find this comparison useful:

from: http://assets.doloreslabs.com/blog/oconnor_biewald_beautiful_data_final_nonlayout_20090803_20090327.pdf

SPSS is a bit harder than Excel but gives you a much wider menu of statistical analysis. You don't have to write computer programs like some of the others – you can just use drop-down menus and point and click.

You might be tempted to just use Excel; resist! Excel doesn't do many of the more complex statistical analyses that we'll be learning later in the course. Make the investment to learn a better program; it has a very good cost/benefit ratio.

1. The Absolute Beginning

Start up SPSS. On any of the computers in the Economics lab (6/150) double-click on the "SPSS" logo to start up the program. In other computer labs you might have to do a bit more hunting to find SPSS (if there's no link on the desktop, then click the "Start" button in the lower left-hand corner, and look at the list of "Programs" to find SPSS).

Sometimes double-clicking on a file that is associated with SPSS doesn't work! Same if you try to download a file and automatically start up SPSS. So start SPSS from the Start bar or desktop icon.

SPSS usually brings up a screen like the one below asking "What would you like to do?" which offers some shortcuts. Just "Cancel" this screen if it appears (later, as you get more familiar with the program, you might find those shortcuts more useful).

2. Load a SPSS Dataset

When SPSS starts, you will be in the "SPSS Data Editor" which looks like this.

Click on "File" then choose "Open" then "Data…" [not "File/Open Database" – that's different].

To open the ATUS data, download it from the class webpage onto your computer desktop. Start SPSS. Then "File \ Open \ Data..." and find "ATUS_2003-09.sav".

SPSS has two tabs (at the bottom left, in the yellow circle above) to change the way you view your data. The "Data View" tab shows the data the way it would look if it were on an Excel sheet. The "Variable View" tab shows more information on the particular variable – most importantly, the "name", "label", and "values". The Name is how SPSS refers to the variable in its menus – these names tend to be inscrutable but you can think of them as nicknames. The Label gives more details, so use the mouse to expand that column so that you can read more. Then values tells you useful information about how the variable is coded.

3. Save your Work!

After you've made changes, you don't want to lose them and have to re-do them. So save your dataset! ("File" then "Save") You might want to give it a new name each time, so that you can easily revert back to an old version if you really screw up on some day.

The computers in the lab wipe the memory clean when you log off so back up your data. Either online (email it to yourself or upload to Blackboard) or use a USB drive. Also, figure out how to "zip" your files (right-click on the data file) to save yourself some hours of up/download time…

4. Getting Basic Statistics

From either the "Data View" or "Variable View" tab, click "Analyze" then "Descriptive Statistics" then "Descriptives":

This will bring up a dialog box asking you which variables you want to get Descriptive Statistics on.

Click on the variable you want. Then click the arrow button in the center box, which will move the variables into the column labeled "Variable(s)". If you make a mistake and move the wrong variable, just highlight it in the "Variable(s)" column and use the arrow to move it back to the left.

Then click "OK" and let the computer work.

If you want a bunch that are all together in the list, click on the first variable that you want, then hold down the "Shift" key and click on the last variable -- this highlights them all. If you want a bunch that are separated, hold down the "Ctrl" key and click on the ones you want.

Later, once you're feeling confident, click on "Options" to see what's there.

5. Create New Variables, like Age-squared or Interaction Age*Dummy, or take logs or whatever

We often create new variables. One common transformation is taking the log. This is a common procedure to cut down the noise and help to examine growth trends. Click on "Transform" and then "Compute…". This will bring up a dialog box labeled "Compute Variable".

Type in the new variable name (whatever you want, just remember it!) under "Target Variable". (You can click 'Type & Label" if you want to enter more info that can remind yourself later.) For example we'll find the log (natural log) of weekly earnings.

Under "Target Variable" type in the new name, "ln_earn" or whatever and then in "Numeric Expression" you tell it what this new variable is. You can make any complicated or convoluted functions that are necessary for particular analyses; for now find the "Function Group" to click on "Arithmetic" and then in the "Functions and Special Variables" list below find "Ln". Double-click it and see that SPSS puts it up into the "Numeric Expression" box with a (?) in the argument. Double-click on the variable, weekly earnings (TRERNWA), that you want to use and then hit "OK".

You'll get a bunch of errors where the program complains about trying to find the log of zero, but it still does what you need. For wages, where many people have wage=0, we often use lnwage = ln(wage + 1) which eliminates the problem of ln(0) that returns an error; for most other people the distinction between ln(1000) and ln(1001) is tiny. You can go back and re-do your variable if you're feeling a need to be tidy.

We often need to recode using logical (Boolean) algebra, so for example to make a variable "Hispanic" you'd type "Hispanic" into the Target Variable, then click the "( )" button (see the yellow circle in the screenshot below) to get a parenthesis, double-click the variable that codes ethnicity so as to get PEHSPNON in the "Numeric Expression" and then add "=1" to finish, so getting a relationship that Hispanic is defined as: (PEHSPNON = 1). SPSS understands that whenever that relation is true, it will put in a 1; where false it will put in a 0.

There are other logical buttons (also in the yellow circle above) for putting together various logical statements. Now that the Census asks people for detailed race info (could be African-American only or Asian or bi-racial or tri-racial in various combinations – see the online note on ATUS for more details), researchers might aggregate together everyone who replies that they are all or part African-American. So maybe create AfricanAmerican as (PTDTRACE = 2) | (PTDTRACE = 6) |(PTDTRACE = 10) |(PTDTRACE = 11) |(PTDTRACE = 12) |(PTDTRACE = 15) |(PTDTRACE = 16) |(PTDTRACE = 19) . (The line up and down, |, represents the logical "OR"; the tilde, ~, is logical "NOT".)

If you wanted to create a variable for those who report themselves as African-American and Hispanic, you'd create the expression (AfricanAmerican = 1) & (PEHSPNON = 1), etc.

If we want more combinations of variables then we create those. Usually a statistical analysis spends a lot of time doing this sort of housekeeping – dull but necessary.

6. Re-Coding complicated variables (like race, education, etc) from inital data

Often we have more complicated variables so we need to be careful in considering the "Values" labels. For instance in the ATUS, as you look at the "Variable View" of your dataset, one of the first variables in the dataset has the name "PEEDUCA", which is short for "PErson EDUCation Achieved" – the person's education level. But the coding is strange: under "Values" you should see a box with "…" in it – click on that to see the whole list of values and what they mean. You'll see that a "39" means that the person graduated high school; a "43" means that they have a Bachelor's degree. Without that "Values" information you'd have no way to know that. It also means that you must do a bit of work re-coding variables before you work with the data. The variable "TEAGE" (which is the person's age) has numbers like 35, 48, 19 – just what you'd expect. These values have a natural interpretation; you don't need a codebook for this one! The variable "TESEX" tells whether the person is male or female – but it doesn't use text, it just lists either the number 1 or 2. We could guess that one of those is male and the other female, but we'd have to go back to "Variable View" to look at "Values" for "TESEX" to find that a 1 indicates a male and a 2 indicates female.

Start with "TESEX" to create, instead, a dummy variable (that takes a value of just zero or one) called "female" that is equal to one if the person is female and zero if not. To do this, click "Transform" then "Compute…" which will bring up a dialog box. The "Target Variable" is the new variable you are creating; for this case, type in "female". The "Numeric Expression" allows considerable freedom in transforming variables. For this case, we will only need a logical expression: "TESEX = 2". You can either type in the variable name, "TESEX", or find the variable name in the list on the left of the dialog box and click the arrow to insert the name.

Later you might encounter cases where you want more complicated dummy variables and want to use logical relations "and" "or" "not" (the symbols "&", "|", "~") or the ">=" or multiplication or division. But in this case, we just need "TESEX = 2" which SPSS interprets as telling it to set a value of 1 in each case where that logical expression is true, and a value of zero in each case where that expression is false. If you go to "Data View" and scroll over (new variables are all the way on the right) you can check that it looks right.

Next we'll create the racial variables. We'll create dummy variables for "white", "African-American", "American Indian/Inuit/Hawaiian/Pacific Islander", and "Asian." We'll lump together the people who give multiple identities with those who give a single one (this is standard in much empirical work, although it is evolving rapidly).

Next we create a dummy variable for "Hispanic". Again use "Transform/Compute…" and label "Target Variable" as "Hispanic" with "Numeric Expression" of "(PEHSPNON = 1)".

Next create dummy variables for education: a dummy for no high school "ed_nohs", for high school but no further "ed_hs", for some college "ed_scol", for a bachelor's degree "ed_coll", and for more than a 4-year degree "ed_adv". "Transform/Compute…", set "Target Variable" as "ed_nohs" and "Numeric Expression" as " PEEDUCA <39". Then "ed_hs" is " PEEDUCA =39"; "ed_scol" is "( PEEDUCA >39)&( PEEDUCA <43)"; "ed_coll" is " PEEDUCA =43"; "ed_adv" is " PEEDUCA >43".

Then run "Descriptive Statistics" to make sure everything looks right – your dummy variables should have min=0 and max=1, for example!

7. Data Sub-Sets

Often we want to compare groups of people within the dataset to each other, for example looking at whether men or women spend more time with their family or watching TV or whatever. Comparisons are often more useful than just raw numbers because comparisons allow us to begin to judge which differences are substantial.

Do this with "Data" then "Select Cases..." to get a screen like this:

Usually we select cases "If condition is satisfied" so choose that, then click on "If..."

This brings up a dialog box that looks like the "Compute Variable" box from above. If we have already created a dummy variable that has values of only zeroes and ones then you can just put that into the "Select Cases" box. If you want a more complicated set then you can build it up using the logical notation that we discussed above. So suppose you want to look at just the subgroup of women between the ages of 18-35. Then we would enter "(TESEX = 2) & (TEAGE > 18) & (TEAGE <= 35)". Click "Continue". Make sure the output is "Filter out unselected cases" (you don't usually want to permanently delete the unselected cases!). Then all of your subsequent analyses will be done for just that subgroup.

Often an analysis will be more concerned with whether a particular item is done rather than how long – for example, when looking at working, whether a person has a second job (so time spent working second job is greater than zero) is probably more important than just how long they spent working at this second job. So often the "if..." statement will be of the form, "X > 0" for whatever variable, X, you're considering.

Later on, we will learn some more sophisticated ways of doing it but for now this is straightforward and clear. It will allow you to do the homework assignment.

8. Example

I will do an example to make this a bit clearer. We will look at the difference in how much time male and female college students spend watching TV. (I hope that for you the answer is "zero"!)

Open the ATUS 2003-2009 dataset.

First use "Transform \ Compute ..." to create a new variable, tv_time, which we set equal to the sum of T120303, watching non-religious TV, and T120304, watching religious TV. (Should we include T120308, playing computer games?)

Use "Transform \ Compute ..." to create another variable, educ_time, which is the sum of time spent doing things relevant to education, T060101 + T060102 + T060103 + T060104 + T060199 + T060301 + T060302 + T060303 + T060399. (Time spent in class and time spent doing homework, mainly.)

I'll also create "ratio_TV_study" that is the ratio of TV_time to educ_time.

Run "Analyze \ Descriptive Statistics \ Descriptives ..." to check that these seem sensible:

Descriptive Statistics
	N	Minimum	Maximum	Mean	Std. Deviation
tv_time	98778	.00	1417.00	165.2058	168.33963
educ_time	98778	.00	1090.00	16.3008	79.47292
ratio_TV_study	5974	.00	120.00	1.0450	3.00829
Valid N (listwise)	5974

Note that the average for "educ_time" is low because most non-students will report zero time spent studying. All of those zero values returned errors when computing the ratio, so this has only 5974 reports of people with more than zero time studying.

Use "Data \ Select Cases ... " to select only college students (those for whom the 13^th variable, TESCHLVL, is equal to 2).

Now to compare men and women I will use "Data \ Split File ... " to split into two groups and compare them – the program will do this automatically for all subsequent analysis.

This Split File screen is:

Now when I run the same "Descriptives" as before, this time I get the output subdivided:

Descriptive Statistics
Edited: sex		N	Minimum	Maximum	Mean	Std. Deviation
= "Male"	tv_time	2018	.00	860.00	127.1665	138.93259
	educ_time	2018	.00	1051.00	112.6056	186.01012
	ratio_TV_study	784	.00	75.00	.8390	3.02939
	Valid N (listwise)	784
= "Female"	tv_time	3581	.00	1100.00	111.4739	124.86338
	educ_time	3581	.00	1090.00	104.8176	173.84758
	ratio_TV_study	1450	.00	120.00	.9117	4.04470
	Valid N (listwise)	1450

This shows that male college students watch an average of 127 minutes of TV per day and devote an average of 113 minutes to school; females watch 111 minutes of TV and devote 105 minutes to their studies. Men watch more TV but also spend a bit more time on school so the average ratio of time spent watching TV to time spent on school is .91 for women and .84 for men.

Finally I'll show a graph,

Note that there are quite a number of respondents who spent zero time studying or zero time watching TV. We would expect a downward relation since it is like a budget set: the more time is spent watching TV, the less is available to do anything else.

To get this graph, choose "Graphs \ Chart Builder ..." and drag the elements to where you want them, like this,

This is the first type of "Scatter/Dot" graph.

For this graph I removed the split, since it didn't look like there were significant differences between men and women in that regard – the same "Data \ Split File ... " but now "Analyze all cases."

I can create a histogram of the ratio of time spent watching TV to time spent studying,

But this isn't much use since it's dominated by the few extreme values of people who spent 100 or more times as many minutes in TV as studying. So this histogram,

plots only those with a ratio less than 2.

(To make this chart, I used "Graphs \ Chart Builder ..." and then chose "Bar." When you put in just one variable on the x-axis it assumes you want a Histogram.)

Now you can go on to do your own analysis, maybe by race/ethnicity? Or go back and add in video game playing? Of the people who didn't watch TV, were there a larger fraction of men or women?

Some Shortcuts

You can use "Analyze \ Descriptive Statistics \ Explore..." which asks you to put in the "Dependent List" which are the variables, whose means you want to find, and then the "Factor List" which defines categories, by which the subgroup means are found. So, for example, if you wanted to look at the time sleeping, depending on whether there are kids in the house, you could put "Time Sleeping" into the "Dependent List" and then "Presence of Household Children" into the "Factor List".

You can get fancier if you create your own factors – suppose you wanted to look at time sleeping for African-American, Hispanic, Asian, and whites at 5 levels of education each (without highschool diploma, with just diploma, with some college, with 4-year degree, with advanced degree) – for a total of 4 x 5 = 20 different categories. So create a new variable that takes the values 1 through 20 and carefully code it up for each of those categories. Then put that into "Factor" in "Explore" and let the machine do your work.

SPSS also has "Analyze \ Compare Means" but we won't get to that yet (although you're welcome to explore it on your own!).