Lab 18


Regressions with Dummy Variables


  • To learn how to create dummy variables and interpret their effects in multiple regression analysis.


Along with interval and ordinal variables we can use nominal level variables that are dichotomous, such as gender, in multiple regression analysis. In the previous labs we have used a dichotomous variable for gender to define subsets of cases. We can also use dichotomous variables as independent variables in regression.

When scored as either a 0 or 1, dichotomies are often referred to as “dummy” variables. They indicate either the absence or presence of a characteristic or trait. Hence they function as a “dummy” for the variable in question.

Their most obvious use is when a variable either already has or can be recoded into two categories. With this in mind, the logic of dummy variables can be extended to enable us to include nominal level variables with more than two categories in our multiple regressions. Examples of such variables include region, province, country, party identification, occupation, marital status or even attitudinal measures.


An Example of Dummy Variables in Multiple Regression

Consider again the hypotheses that egalitarian attitudes depend on gender, personal finances and partisanship.  Working once again with CES2011 data, we could develop an equation of the form: y=a + B(x1) + B(x2) + B(x3) ,where x1 = female, x2 = personal finances, x3 = partisanship.

However in this case we will be using dummy versions of each of the three independent variables, yielding:

y= a + b(x1) + b(x2a) + b(x2b) + b(x3a) +b(x3b) + b(x3c), where:

b(x1) = Female (by coding female respondents as 1 and males as 0);
b(x2a) = Worsened financial status (coding all other respondents as 0) ;
b(x2b) = Improved financial status (coding all other respondents as 0);
b(x3a) = Conservative partisanship (by coding Conservative partisans as 1 and everyone else as 0);
b(x3b) = Bloc Quebecois partisanship (by coding BQ partisans as 1 and everyone else as 0);
b(x3c ) = NDP partisanship (by coding NDP partisans as 1 and others as 0);

You will notice in both the equation and the syntax below that Liberal partisans and those whose financial circumstances have not changed have been left out of the list.

In working with dummy variables, at least one category must always be omitted, leaving something with the value of zero with which to compare each of the other categories. In the case of partisan identity, we need some respondents to be scored as not Conservative, nor BQ, nor NDP. If, along with each of the other partisan groupings, we were to include Liberal partisanship as a predictor in the equation, its values would be perfectly correlated (negatively) with the combination of the other partisan dummy variables. This would create a situation of multi-collinearity. A similar problem will occur if we were to enter a “stayed the same” dummy from the personal finances variable. So in each case we intentionally leave out at least one of the categories. The omitted category becomes the reference category against which the effects of the other categories are assessed. We can interpret the results as the difference between each category and this omitted category.

You can arbitrarily choose any category to be omitted. However, carefully consider your options and exclude whichever category most sensibly should be the one best suited to be the reference value for all the other variables. Typically, this is the most common or largest category, but theoretical concerns also should be considered.

It is important that, with the exception of the reference category, you include each of the other categories of your variable in the regression.

Example Syntax

weight by WGTSamp.
*Preparing indicators of Egalitarianism (Attitudes re Inequality)*.
*declare missing values on pes11_41*.
missing values pes11_41 (8,9).
*reverse scoring on pes11_41 and make it range from 0-1*.
recode pes11_41 (1=1) (2=.75) (3=.5) (4= .25)
   (5=0) into undogap.
value labels undogap 0 'muchless' .25 'someless' .5 'asnow'
   .75 'somemore' 1 'muchmore'.
*rescale mbs11_k2 from 0-10 to 0-1 and reverse its scoring*.
missing values mbs11_k2 (-99).
compute govact = (((mbs11_k2 * -1) +10)/10).
value labels govact 0'not act' 1 'gov act'.
*recode and re-label mbs11_b3 and pes11_52b*.
recode mbs11_b3 (1=1) (2=0) into goveqch.
value labels goveqch 1 'decent living' 0 'leave alone'.
*create an indexed variable (alpha=.66).
compute rawegal = undogap + govact + goveqch.
*create gender indicator from Lab6*.
recode rgender11 (1=0) (5=1) into female.

*create finance measures (from Lab 7).
missing values cps11_66 (8,9).
recode cps11_66 (1=1) (3=0) (5=.5) into finances.
variable labels finances 'personal finances'.
value labels finances 0 'worse' .5 'same' 1 'better'.

*create dummies for Finances*.
recode finances (0 = 1) (else = 0) into worse.
recode finances (.5 =1) (else = 0) into same.
recode finances (1=1) (else = 0) into better.

*create party identification measure (from Labs 6 &13)*.
*These replace an interval measure of partisan feeling ConFeel.*
recode cps11_71 (2=1) (1=2) (4=3) (3=4)into PID4.
value labels PID4 1 'Cons' 2 'Lib' 3 'BQ' 4 'NDP'.

*create dummies or PID*.
recode PID4 (1=1) (else=0) into Cons.
recode PID4 (2=1) (else=0) into Lib.
recode PID4 (3=1) (else=0) into BQ.
recode PID4 (4=1) (else=0) into NDP.

*multiple regression with dummies*.
regression variables = rawegal female worse better Cons BQ NDP
   /statistics coeff r tol anova
   /descriptives = n
   /dependent = rawegal
   /method = enter.

Syntax Legend

Missing values and recodes from the ongoing example are again specified.

Gender personal finance and partisan dummies are created through recodes.

The regression command lists the dependent and six independent variables. Note that the reference categories are not included in the regression command.

As before, the /statistics subcommand asks for regression coefficients, explained variance (r), and tolerance.

The /descriptives subcommand asks for the number of cases used in the regression.

The dependent variable is declared with the /dependent subcommand.

The /method subcommand indicates that education, gender, and the five regional dummies should be entered together in the regression.

Example Output


Model Summary
Model R R Square Adjusted R Square Std. Error of the Estimate
1 .436a .190 .184 .61792
a. Predictors: (Constant), NDP, better, female, BQ, worse, Cons



Model Sum of Squares df Mean Square F Sig.
1 Regression 75.493 6 12.582 32.953 .000b
Residual 322.342 844 .382
Total 397.835 850
a. Dependent Variable: rawegal
b. Predictors: (Constant), NDP, better, female, BQ, worse, Cons



Model Unstandardized Coefficients Standardized Coefficients t Sig.
B Std. Error Beta Tolerance
1 (Constant) 2.196 .041 53.053 .000
female .142 .043 .104 3.330 .001 .992
worse .153 .054 .090 2.823 .005 .934
better -.234 .060 -.125 -3.893 .000 .935
Cons -.464 .051 -.304 -9.166 .000 .873
BQ .164 .079 .067 2.082 .038 .935
NDP .168 .067 .082 2.520 .012 .915



Interpretation of Output

The Adjusted R-square figure indicate that approximately 18% of variance in egalitarian attitudes is explained by the combination of gender, personal finance and partisanship. And the equation as a whole is statistically significant,

The b values for gender, personal finances and partisan preference indicate the direction and number of units (as coded) of change in the dependent variable due to a one-unit change in each independent variable. The statistically significant results for all coefficients results show that egalitarianism depends in part on all of the independent variables. Controlling for the effects of personal finance and partisanship, a one-category increase in the female (moving from male to female) produces a .146 unit increase in income. A university graduate, for example, could earn $10-20,000 more egalitarianism. income than someone not yet finished university.

The multiple regression results also indicate that, compared with the reference category of unchanged personal finances, worsened finances is associated with and increase of .153 units in egalitarian attitudes. In contrast, improving personal finance is associated with a decrease in egalitarianism over nearly one-quarter of a unit (-.234). Both estimates control for gender and partisanship as they are included in the equation.

The b values for the partisanship dummies are all, of course, interpreted relative to their reference category, Liberal partisan preference. The coefficient for Conservatives (-.464) indicate that compared with Liberal, Conservatives score about one half unit lower on egalitarianism. BQ and NDP partisans, in contrast rank significantly higher in egalitarianism than do Liberals.

We should also note that because Liberal is the same reference group for each of the other partisan dummies, we can directly compare the coefficients for each of the partisan groups to one another: BQ and NDP partisans, for example, are, both more egalitarian than Conservatives.

We can also determine whether or not each of these partisan differences are statistically significant by examining the standard errors of the unstandardized partisan coefficients.  If any pair of partisan dummy coefficients do not overlap within +/- 1.96 standard errors of one another, we can be 95% confident the differences are not due to sampling error.  In this instance, despite both BQ and NDP partisans being more egalitarian than Conservatives (and Liberals), they do not differ significantly from one another. This is evident in that their unstandardized coefficients are well within +/- 1.96 standard errors of one another.  This task can be made easier by asking SPSS to produce confidence intervals with your output.  You can do this by including the “ci” command in the “/statistics” line of your regression syntax.  However, it is useful to know how to calculate those confidence intervals yourself by examining the standard errors of the b values, since some of the research you read may not report confidence intervals.

The constant (or y-intercept) indicates the value of  ‘a’ in the regression equation.

From the information in the regression output we can write the following regression equation:

egalitarianism = 2.19 + .14(female) + .153(worse) – .234(better)
– .43(Conservative) + .164(BQ) + .168(NDP).

Beta coefficients provided in the output enable us to estimate the relative effects of each independent variable in standard deviation units. Note that although all the dummies use the same (0-1) measurement, differences in their standard errors produce Beta scores that differ from their b scores.


  1. Perform a multiple regression analysis including a set of dummy variables built from a multi-category nominal variable.
  2. You can begin by using the syntax from the above example working with the CES 2011 data set or work with another data set of your choice.
  3. Identify a multi-category nominal variable in the data set to “dummy up” and use it as a control in your analysis.
  4. Include your dummies in the multiple regression equation.
  5. Examine your output to determine the influence of your dummy variables.


  • What happens if you change the reference category for your set of dummy variables?  If, for instance, we use Conservative as the reference category instead of Liberal, how do we interpret the resulting B values?  Does the model explain any more or less of the observed variance? Do our substantive conclusions about the results change in any way?
  • What happens when you convert ordinal variables such as education into a series of dummy variables?
  • What else can we do with dummies?


  • When we change the reference category, we must interpret the results as indicating differences from the new reference category. Your substantive conclusions and explained variance should be essentially the same.
  • With ordinal variables you still enter a dummy variable for all categories but one into the regression. Any you still interpret your results in terms of its difference from the reference category.
  • Dummy variables can be useful in exploring the non-linear effects of some independent variables in regression analysis.  For example, if you have a theory that middle age individuals tend to be more knowledgeable about politics than their younger or older counterparts, then an independent variable that simply measures respondents’ ages in years, from youngest to oldest, is not very helpful in a linear regression.  Such a variable will likely produce modest and insignificant age effects, even if middle-aged individuals are significantly different from others.  Instead, you can create dummy variables for several different age categories and enter these in your regression model to see whether the middle aged are more knowledgeable than others.