new Lab 18

Poli 101 LAB MANUAL 18

Regression with Dummy Variables

PURPOSE

  • To learn how to create dummy variables and interpret their effects in multiple regression analysis.

MAIN POINTS

Along with interval and ordinal variables we can use nominal level variables that are dichotomous in multiple regression analysis. In the previous lab we have used a dichotomous variable for gender to define subsets of cases. We can also use dichotomous variables such as gender as independent variables in regression.

When scored as either a 0 or 1, dichotomies are often referred to as “dummy” variables. They indicate either the absence or presence of a characteristic or trait. Hence they function as a “dummy” for the variable in question.

Their most obvious use is when a variable already has two categories. Through recoding into dichotomies, however, the logic of dummy variables can be extended to enable us to include nominal or ordinal level variables with more than two categories in our multiple regressions. Examples of such variables include region, state, country, party identification, occupation, marital status or even attitudinal measures.

An Example of Dummy Variables in Multiple Regression

Consider again the hypotheses that attitudes regarding recreational marijuana use depend on ideology, age, education and political interest. To these we might add gender, partisanship and other variables..  Working once again with the PPIC October 2016 data, we could develop an equation of the form:
y=a + B(x1) + B(x2) + B(x3) +B(x4) + B(x5) + B(x6),
where x5 = female, x6 =  party id.

However in this case we will be using dummy versions of each of the final two independent variables, yielding:

y=a + B(x1) + B(x2) + B(x3) +B(x4) + B(x5) + B(x6a) + B(x6b), , where:

B(x5) = Female (by coding female respondents as 1 and males as 0);
B(x6a) =  Republican party id (by coding Republican identifiers as 1 and everyone else as 0);
B(x6b) =  Democratic party id (by coding Democratic identifiers as 1 and everyone else as 0);

You will notice in both the equation above and the syntax below that Independents (who do not identify with either Republicans or Democrats) are left out of the list.

In working with dummy variables, one category must always be omitted, leaving something with the value of zero with which to compare each of the other categories. In the case of partisan identity, we need some respondents to be scored as neither Republican nor Democratic. If, along with each of the other partisan groupings, we were to include Independents as a predictor in the equation, its values would be perfectly correlated (negatively) with the combination of the other partisan dummy variables. This would create a situation of multi-collinearity. A similar problem will occur if we were to enter a “stayed the same” dummy from a variable such as an retrospective economic assessment variable. So in each case we intentionally leave out at least one of the categories. The omitted category becomes the reference category against which the effects of the other categories are assessed. We can interpret the results as the difference between each category and this omitted category.

You can arbitrarily choose any category to be omitted. However, consider carefully your options and exclude whichever category is the one best suited to be the reference value for all the other variables. Typically, this is the most common or largest category, but theoretical concerns also should be considered.

It is important that, with the exception of the reference category, you include each of the other categories of your variable in the regression.

Example Syntax

*Dummy Runs*.
*Weighting the Data*.
weight by weight.
*Recoding MJ Index Items*.
recode q21 (1=1) (2=0) into MJPropD.
value labels MJPropD 1 'yes' 0 'no'.
recode q36 (1=1) (2=0) into MJLegalD.
value labels MJLegalD 1 'yes' 0 'no'.
recode q36a (1=1) (2=.5) (3=.0) into MJTry.
value labels MJTry 1 'recent' .5 'not recent' 0 'no'.

*Constructing an Index with alpha = .777*.
compute RawMJ3 = (MJPropD + MJLegalD + MJTry).

*Creating IV Indicators of Party Identification & Ideology*.
recode q40c (1=0) (3=.5) (2=1) into Democrat.
value labels Democrat 1 'Democ' .5 'Indep' 0 'Repub'.
*Democrat5 (adapted from from lab 7)*.
if (q40c = 1) and (q40e =1) Democrat5 =0.
if (q40c = 1) and (q40e =2) Democrat5 =.25.
if (q40c = 3) Democrat5 =.5.
if (q40c =2) and (q40d =2) Democrat5 = .75.
if (q40c =2) and (q40d=1) Democrat5 =1.
value labels Democrat5 0 'strRep' .25 'Rep' .5 'Indep' .75 'Dem' 1 'strDem'.
recode q37 (1,2=1) (3=.5) (4,5= 0) into liberal.
value labels liberal 1 'liberal' .5 'middle' 0 'conserv'.
recode q37 (1=1) (2=.75) (3= .5 ) (4=.25) (5= 0) into liberal5.
value labels liberal5 1 'vlib' .75 'liberal'.5 'middle' .25 'conserv' 0 'vcons'.

recode d1a (1=0) (2= .2) (3= .4) (4=.6) (5=.8) (6=1) into age.
value labels age 0 '18+' .2 '25+' .4 '35+' .6 '45+' .8 '55+' 1 '65+'.

*Education*.
recode d6 (1=0) (2=.25) (3=.5) (4=.75) (5=1) into educ.
value labels educ 0 '<hs' .25 'hs' .5 'col' .75 'grad' 1 'post'.

*Income*.
recode d10 (1 =0) (2=.17) (3=.34) (4=.5) (5=.66) (6=.83) (7=1) into income.
value labels income 0 '<$20k' .17 '$20k+' .34 '$40k+' .5 '$60k+' .66 '$80k+' .83 '$100k+' 1 '$200k+' .

*Interest*.
recode q38 (1=1) (2=.66) (3=.33) (4=0) into interest.
value labels interest 0 'none' .33 'only a little' .66 'fair amount' 1 'great deal'.

*creating dummy variables for gender and party*.
recode gender (1=0) (2=1) into female.
recode gender (1=1) (2=0) into male.

recode q40c (1=1) (else = 0) into RepID.
recode q40c (2=1) (else = 0) into DemID.
recode q40c (3=1) (else = 0) into IndID.

regression variables=RawMJ3 liberal5 age educ interest female RepID DemID
   /statistics anova coeff r tol
   /descriptives = n
   /dependent = RawMJ3
   /method = enter.

Syntax Legend

Missing values and recodes from the ongoing example are again specified.

The original party identification measures Democrat and Democrat5 are retained for comparative purposes. These can be used in optional runs to consider issues of collinearity as well as possible advantages of dummy variables.

Gender and party id dummies are created through recodes.

The regression command lists the dependent and seven independent variables. Note that the reference categories for the dummy variables are not included in the regression command.

As before, the /statistics subcommand asks for regression coefficients, explained variance (r), and tolerance.

The /descriptives subcommand asks for the number of cases used in the regression.

The dependent variable is declared with the /dependent subcommand.

The /method subcommand indicates that the ideology, age, education, interest as well as the dummies for gender and party id should be entered together in the regression.

Example Output

Model Summary
Model R R Square Adjusted R Square Std. Error of the Estimate
1 .483a .234 .228 1.01427
a. Predictors: (Constant), DemID, age, educC, female, interest, liberal5, RepID
ANOVAa
Model Sum of Squares df Mean Square F Sig.
1 Regression 295.901 7 42.272 41.090 .000b
Residual 970.027 943 1.029
Total 1265.928 950
a. Dependent Variable: RawMJ3
b. Predictors: (Constant), DemID, age, educC, female, interest, liberal5, RepID

Coefficients

Model Unstandardized Coefficients Standardized Coefficients t Sig.
B Std. Error Beta Tol
1 (Constant) 1.035 .132 7.817 .000
liberal5 1.154 .123 .309 9.370 .000 .747
age -.546 .104 -.155 -5.243 .000 .925
educ .195 .123 .047 1.577 .115 .922
interest .688 .124 .166 5.537 .000 .902
female -.272 .067 -.118 -4.073 .000 .970
RepID -.479 .092 -.176 -5.214 .000 .711
DemID -.260 .078 -.109 -3.335 .001 .758


Interpretation of Output

The Adjusted R-square figure indicates that approximately 23% of variance in attitudes toward recreational marijuana is explained by the combination of ideology age, education , political interest, gender and party id. Moreover, the equation as a whole is statistically significant.

The b values for the independent variables indicate the direction and number of units (as coded) of change in the dependent variable due to a one-unit change in each independent variable. The statistically significant results for all the coefficients other than education show that recreational marijuana attitudes depend in part on all of the other independent variables.

Interpreting the gender (female) dummy variable results show that controlling for the effects of ideology, age, education, interest and party id, a one-category increase in the female dummy (moving from male to female) produces a .272 unit decrease in support for recreational marijuana.

The b values for the partisanship dummies are all, of course, interpreted relative to their reference category, those who declare themselves to be Independents. Hence compared with Independents, being a Republican decreases support for recreational marijauna  on the RawMJ3 index by -.479 units. In comparison, being a Democratic identifier is associated with a decrease in support for recreational marijuana by .260 units.  Both estimates control for all the other variables included in the equation.

We can also determine whether or not each of these partisan differences are statistically significant by examining the standard errors of the unstandardized partisan coefficients.  If the partisan dummy coefficients do not overlap within +/- 1.96 standard errors of one another, we can be 95% confident the differences are not due to sampling error.  In this instance, the Republicans and Democrats differ significantly from one another. This is evident in that their unstandardized coefficients are substantially more +/- 1.96 standard errors apart.  This task can be made easier by asking SPSS to produce confidence intervals with your output.  You can do this by including the “ci” command in the “/statistics” line of your regression syntax.  However, it is useful to know how to calculate those confidence intervals yourself by examining the standard errors of the b values, since some of the research you read may not report confidence intervals.

The constant (or y-intercept) indicates the value of  ‘a’ in the regression equation.

From the information in the regression output we can write the following regression equation:

RawMJ3 = 1.35 +1.154 (liberal5) – .546(age) + .195(educC) + .688(interest) – .272(female) – .479(RepID) – .260(DemID)

Beta coefficients provided in the output enable us to estimate the relative effects of each independent variable in standard deviation units. Note that although all the dummies use the same (0-1) measurement, differences in their standard errors produce Beta scores that differ from their b scores.

Instructions

    1. Perform a multiple regression analysis including a set of dummy variables built from a multi-category nominal variable.
    2. You can begin by using the syntax from the above example working with the PPIC October 2016 data set or work with another data set of your choice.
    3. Identify a multi-category nominal variable in the data set to “dummy up” and use it as a control in your analysis. Below is the syntax for ethnicity and language of interview.
    4. Include your dummies in the multiple regression equation.
    5. Examine your output to determine the influence of your dummy variables.
      Additional Dummy Syntax

      missing values d8com (5 thru 9).
      recode d8com (3=1) (else = 0) into Hisp.
      recode d8com (2=1) (else = 0) into Black.
      recode d8com (1=1) (else =0) into Asian.
      
      recode language (0=0) (1=1) into spanish.

QUESTIONS FOR REFLECTION

  • What happens if you change the reference category for your set of dummy variables?  If, for instance, we use Republicans as the reference category instead of Independents, how do we interpret the resulting B values?  Does the model explain any more or less of the observed variance? Do our substantive conclusions about the results change in any way?
  • What happens when you convert ordinal variables such as education into a series of dummy variables?
  • What else can we do with dummies?

DISCUSSION

  • When we change the reference category, we must interpret the results as indicating differences from the new reference category. Your substantive conclusions and explained variance should be essentially the same.
  • With ordinal variables you still enter a dummy variable for all categories but one into the regression. And you still interpret your results in terms of differences from the reference category.
  • Dummy variables can be useful in exploring the non-linear effects of some independent variables in regression analysis.  For example, if you have a theory that middle age individuals tend to be more knowledgeable about politics than their younger or older counterparts, then an independent variable that simply measures respondents’ ages in years, from youngest to oldest, is not very helpful in a linear regression.  Such a variable will likely produce modest and insignificant age effects, even if middle-aged individuals are significantly different from others.  Instead, you can create dummy variables for several different age categories and enter these in your regression model to see whether the middle aged are more knowledgeable than others.