UCSC Lab 18

Poli 101 LAB MANUAL 18

Regressions with Dummy Variables

PURPOSE

  • To learn how to create dummy variables and interpret their effects in multiple regression analysis.

MAIN POINTS

Along with interval and ordinal variables we can use nominal level variables that are dichotomous, such as gender, in multiple regression analysis. In the previous labs we have used a dichotomous variable for gender to define subsets of cases. We can also use dichotomous variables as independent variables in regression.

When scored as either a 0 or 1, dichotomies are often referred to as “dummy” variables. They indicate either the absence or presence of a characteristic or trait. Hence they function as a “dummy” for the variable in question.

Their most obvious use is when a variable either already has or can be recoded into two categories. With this in mind, the logic of dummy variables can be extended to enable us to include nominal level variables with more than two categories in our multiple regressions. Examples of such variables include region, province, country, party identification, occupation, marital status or even attitudinal measures.

An Example of Dummy Variables in Multiple Regression

Consider again the hypotheses that egalitarian attitudes depend on gender, personal finances and partisanship.  Working once again with ANES2012 data, we could develop an equation of the form: y=a + B(x1) + B(x2) + B(x3) ,where x1 = female, x2 = party id, x3 = economic evaluations.

However in this case we will be using dummy versions of each of the three independent variables, yielding:

y= a + b(x1) + b(x2a) + b(x2b) + b(x3a) +b(x3b)  , where:

b(x1) = Female (by coding female respondents as 1 and males as 0);
b(x2a) =  Republican party id (by coding Republican identifiers as 1 and everyone else as 0);
b(x2b) =  Democratic party id (by coding Democratic identifiers as 1 and everyone else as 0);
b(x3a) = Worsened Retrospective Economic Assessment (coding all other respondents as 0);
b(x3b) = Improved Retrospective Economic Assessment (coding all other respondents as 0);

You will notice in both the equation and the syntax below that Independents and those who view the economy as unchanged have been left out of the list.

In working with dummy variables, at least one category must always be omitted, leaving something with the value of zero with which to compare each of the other categories. In the case of partisan identity, we need some respondents to be scored as not Republican, nor Democratic. If, along with each of the other partisan groupings, we were to include Independents as a predictor in the equation, its values would be perfectly correlated (negatively) with the combination of the other partisan dummy variables. This would create a situation of multi-collinearity. A similar problem will occur if we were to enter a “stayed the same” dummy from the retrospective economic assessment variable. So in each case we intentionally leave out at least one of the categories. The omitted category becomes the reference category against which the effects of the other categories are assessed. We can interpret the results as the difference between each category and this omitted category.

You can arbitrarily choose any category to be omitted. However, carefully consider your options and exclude whichever category is the one best suited to be the reference value for all the other variables. Typically, this is the most common or largest category, but theoretical concerns also should be considered.

It is important that, with the exception of the reference category, you include each of the other categories of your variable in the regression.

Example Syntax

weight by weight_full.

*Coding the DV's indicators*.
missing values cses_govtact (-9 thru -6).
recode cses_govtact (1=1) (2=.75) (3= .5) (4= .25) (5=0) into eceq1.

missing values ineqinc_ineqreduc (-9 thru -6).
recode ineqinc_ineqreduc (1=1) (2=0) (3= .5) into eceq3.

missing values guarpr_self (-9 thru -2).
recode guarpr_self (1=1) (2=.832) (3= .666) (4= .5) (5= .332)
   (6= .166) (7=0) into eceq5.

*Constructing the DV's Index*.
compute RawEqIndex = eceq1 + eceq3 + eceq5.

*Constructing Dummy Variables*.
*Gender dummy*.
recode gender_respondent (1=0) (2=1) into female.

*Party dummies*.
missing values pid_self (-9 thru 0, 5).
fre var pid_self.
recode pid_self (1=1) (else=0) into Dem.
recode pid_self (3=1) (else=0) into Indep.
recode pid_self (2=1) (else=0) into Rep.
fre var dem Indep rep.

*Econ-past*.
missing values econ_ecpast_x (-9 thru -1).
recode econ_ecpast_x (4,5 = 1) (else = 0) into retroworse.
recode econ_ecpast_x (3 =1) (else = 0) into retrosame.
recode econ_ecpast_x (1,2=1) (else = 0) into retrobetter.

regression variables = RawEqIndex female Dem Rep retroworse retrobetter
   /statistics anova coeff r tol
   /descriptives = n
   /dependent = RawEqIndex
   /method = enter.

Syntax Legend

Missing values and recodes from the ongoing example are again specified.

Gender, party id and retrospective economic assessment dummies are created through recodes.

The regression command lists the dependent and five independent variables. Note that the reference categories are not included in the regression command.

As before, the /statistics subcommand asks for regression coefficients, explained variance (r), and tolerance.

The /descriptives subcommand asks for the number of cases used in the regression.

The dependent variable is declared with the /dependent subcommand.

The /method subcommand indicates that the gender, party id and economic assessment dummies should be entered together in the regression.

Example Output

Model Summary
Model R R Square Adjusted R Square Std. Error of the Estimate
1 .487a .238 .237 .73686
a. Predictors: (Constant), retrobetter, female, Rep, Dem, retroworse

 

 

ANOVAa
Model Sum of Squares df Mean Square F Sig.
1 Regression 853.021 5 170.604 314.206 .000b
Residual 2737.232 5041 .543
Total 3590.253 5046
a. Dependent Variable: RawEqIndex
b. Predictors: (Constant), retrobetter, female, Rep, Dem, retroworse

 

Model Unstandardized Coefficients Standardized Coefficients t Sig.
B Std. Error Beta Tolerance
1 (Constant) 1.313 .024 55.000 .000
female .059 .021 .035 2.827 .005 .989
Dem .374 .025 .211 14.789 .000 .744
Rep -.433 .027 -.230 -16.338 .000 .762
retroworse -.213 .025 -.121 -8.466 .000 .737
retrobetter .188 .027 .102 7.014 .000 .720

 

Interpretation of Output

The Adjusted R-square figure indicate that approximately 24% of variance in attitudes toward economic equality is explained by the combination of gender, party id and economic assessments. And the equation as a whole is statistically significant,

The b values for gender,party id and economic assessment indicate the direction and number of units (as coded) of change in the dependent variable due to a one-unit change in each independent variable. The statistically significant results for all coefficients results show that economic equality attitudes depend in part on all of the independent variables. Controlling for the effects of economic assessment and party id, a one-category increase in the female (moving from male to female) produces a .o59 unit increase in support for economic equality.

The multiple regression results also indicate that, compared with the reference category of unchanged economic assessment, those who view the economy as being worse is associated with and decrease of -.213 units in support for economic equality. In contrast, perceiving the economy as improved is associated with a increase in support for economic equality of .188 units  Both estimates control for gender and party id as these variables are included in the equation.

The b values for the partisanship dummies are all, of course, interpreted relative to their reference category, those who declare themselves to be Independents. The coefficient for Republicans (-.433) indicate that compared with Independents, Republicans score nearly one half a unit lower on the economic equality index. Democrats, in contrast, rank significantly higher in on the economic equaltiy index than do Independents.

We can also determine whether or not each of these partisan differences are statistically significant by examining the standard errors of the unstandardized partisan coefficients.  If any pair of partisan dummy coefficients do not overlap within +/- 1.96 standard errors of one another, we can be 95% confident the differences are not due to sampling error.  In this instance, the Republicans and Democrats differ significantly from one another. This is evident in that their unstandardized coefficients are substantially more +/- 1.96 standard errors apart.  This task can be made easier by asking SPSS to produce confidence intervals with your output.  You can do this by including the “ci” command in the “/statistics” line of your regression syntax.  However, it is useful to know how to calculate those confidence intervals yourself by examining the standard errors of the b values, since some of the research you read may not report confidence intervals.

The constant (or y-intercept) indicates the value of  ‘a’ in the regression equation.

From the information in the regression output we can write the following regression equation:

EconEquality = 1.31 + .059(female) + .374(Dem) – .433(Repub) – .213(Worse) + .188(Better).

Beta coefficients provided in the output enable us to estimate the relative effects of each independent variable in standard deviation units. Note that although all the dummies use the same (0-1) measurement, differences in their standard errors produce Beta scores that differ from their b scores.

Instructions

  1. Perform a multiple regression analysis including a set of dummy variables built from a multi-category nominal variable.
  2. You can begin by using the syntax from the above example working with the ANES 2012 data set or work with another data set of your choice.
  3. Identify a multi-category nominal variable in the data set to “dummy up” and use it as a control in your analysis.
  4. Include your dummies in the multiple regression equation.
  5. Examine your output to determine the influence of your dummy variables.

QUESTIONS FOR REFLECTION

  • What happens if you change the reference category for your set of dummy variables?  If, for instance, we use Republicans as the reference category instead of Independents, how do we interpret the resulting B values?  Does the model explain any more or less of the observed variance? Do our substantive conclusions about the results change in any way?
  • What happens when you convert ordinal variables such as education into a series of dummy variables?
  • What else can we do with dummies?

DISCUSSION

  • When we change the reference category, we must interpret the results as indicating differences from the new reference category. Your substantive conclusions and explained variance should be essentially the same.
  • With ordinal variables you still enter a dummy variable for all categories but one into the regression. And you still interpret your results in terms of differences from the reference category.
  • Dummy variables can be useful in exploring the non-linear effects of some independent variables in regression analysis.  For example, if you have a theory that middle age individuals tend to be more knowledgeable about politics than their younger or older counterparts, then an independent variable that simply measures respondents’ ages in years, from youngest to oldest, is not very helpful in a linear regression.  Such a variable will likely produce modest and insignificant age effects, even if middle-aged individuals are significantly different from others.  Instead, you can create dummy variables for several different age categories and enter these in your regression model to see whether the middle aged are more knowledgeable than others.