new Lab 17

Poli 101

LAB 17

Multiple Regression


  • To learn how to perform and interpret multiple regression analysis.
  • To understand the meaning and use of Beta.
  • To learn about different regression procedures.


Multiple Regression

  • Regression analysis can be performed with more than one independent variable. Regression involving two or more independent variables is called multiple regression.
  • Hence, the multiple regression equation takes the form:
    • y = a + b1x1 + b2x2 +b3x3 … + bnxn
    • Dependent Variable = Constant + (Coefficient1 x Independent Variable1) + (Coefficient2 x Independent Variable2) + (Coefficient3 x Independent Variable3) + … etc.
  • y is the predicted value of the dependent variable given the values of the independent variables (x1, x2, x3…etc.).
  • What is unique about multiple regression is that for each of the independent variables, the analysis controls for the effect of the other independent variables. This means that the effect of any independent variable is estimated apart from the effect of all the other independent variables. In this way, it accomplishes the same sort of controlled comparisons as control tables.
  • Interpretation of the unstandardized coefficients (b) is much as it is in bivariate regression, i.e., a unit of change in the independent variable affects the dependent variable by a factor indicated by the regression coefficient (b). Only in this case it does so while controlling for the effects of the other independent variables.
  • The R2 value for regression is similar to the r2 in bivariate regression. The R2 for multiple regression indicates the proportion of variation in the dependent variable explained collectively by the set of independent variables taken. R2 increases with each independent variable added to the regression model, even when the added variables have no effect on the dependent variable.  Therefore we use an Adjusted Rthat corrects for this artificial inflation of R2 in multiple regression models .

b & Beta

  • The relative size of two b values is not necessarily a good indication of which independent variable is a better predictor of the dependent variable since the magnitude of b depends on its particular units of measurement. We can often make better comparisons among independent variables by standardizing the variables. This is done by converting the values of an independent variable into units of standard deviation from its mean.
  • Beta is the standardized version of b. It indicates the effect that one standard deviation unit change in the independent variable has on the dependent variable (also measured in standard deviation units).  The use of Beta coefficients facilitates comparisons among independent variables since they are all expressed in standardized units.
  • Values of b and Beta are both calculated in a multiple regression analysis. Values of b are used in formulating a multiple regression equation. However, b values do not have a common benchmark for comparison since the b values depend on how the variables are coded.
  • Betas from the same multiple regression analysis can be readily compared to one another.  The higher the Beta value, the stronger the relationship the respective independent variable has with the dependent variable.
  • Comparing Betas from equations based on different samples can be misleading, however, since the variance of the standard errors for the samples may differ substantially. In such cases it is best to report the unstandardized b value.
  • Values of b allow us to understand the theoretical importance of an independent variable.  When variables are measured in concrete units like dollars, years, or percentages, b is relatively easy to interpret because it expresses the potential effects of an independent variables on the dependent variable both in their original units of measurement.  The meaning of Beta is not intuitively clear and cannot be interpreted concretely, but when independent variables are measured in different units only Betas allow us to compare directly the effects of different independent variables on the dependent variable.


  • When some of the independent variables are very closely related to one another, multiple regression does not allow us to separate out their independent effects on the dependent variable. This is referred to as Multicollinearity (several things being on the same regression line).
  • Multicollinearity usually occurs when two or more independent variables measuring the same concept are entered into the regression equation.  It often results in strong, but statistically insignificant, regression coefficients (due to large standard errors). We look for multicollinearity either by examining the correlations among our independent variables or, more rigorously, by requesting tolerance measures (tol) as part of our regression analysis, using the statistics subcommand.
  • Tolerance levels indicate the extent to which an independent variable is related to other independent variables in the model. Its values range from zero (.00) to one (1.0). A tolerance of 1.0 means a variable is unrelated to the other independent variables. A tolerance of .00 means an independent variable is strongly related to another independent variable.
  • Multicollinearity only becomes a problem as tolerance approaches zero. As a general rule, a tolerance score of .20 or less indicates that collinearity is a problem.  When this is found to be the case, either eliminate one of the variables involved, or combine them into an index.

Example #1 Multiple Regression Using Public Opinion Data


PPIC October 2016

  • Dependent Variable: Index of Support for Recreational Marijuana (RawMJ3)
      • Index of Support for Recreational Marijuana (7 categories; Alpha =.777)
        • Indicators: q21 recoded as MJPropD
        • q36 recoded as  MJLegal Partisan Identification
        • Political Ideologyq36a recoded as MJTry.Independent Variables:
          • Partisan Identification
          • Political Ideology
          • Age
          • Education
          • Interest
    • Hypotheses Arrow Diagrams

      • H1: Democratic Party ID → Support Recreational Marijuana
      • H2: Liberal Ideology → Support Recreational Marijuana
      • H3: Young  → Support Recreational Marijuana
      • H4: Educated → Support Recreational Marijuana
      • H6: Interested → Support Recreational Marijuana
    • Syntax
      *Weighting the data*.
      weight by weight.*Recoding MJ Index Items*.
      recode q21 (1=1) (2=0) into MJPropD.
      value labels MJPropD 1 ‘yes’ 0 ‘no’. recode q36 (1=1) (2=0) into MJLegalD.
      value labels MJLegalD 1 ‘yes’ 0 ‘no’. recode q36a (1=1) (2=.5) (3=.0) into MJTry.
      value labels MJTry 1 ‘recent’ .5 ‘not recent’ 0 ‘no’.*Constructing an Index with alpha = .777*.
      compute RawMJ3 = (MJPropD + MJLegalD + MJTry).*Creating IV Indicators of Party Identification & Ideology*.
      recode q40c (1=0) (3=.5) (2=1) into Democrat.
      value labels Democrat 1 ‘Democ’ .5 ‘Indep’ 0 ‘Repub’.

      *Democrat5 (adapted from from lab 7)*.
      if (q40c = 1) and (q40e =1) Democrat5 =0.
      if (q40c = 1) and (q40e =2) Democrat5 =.25.
      if (q40c = 3) Democrat5 =.5.
      if (q40c =2) and (q40d =2) Democrat5 = .75.
      if (q40c =2) and (q40d=1) Democrat5 =1.
      value labels Democrat5 0 'strRep' .25 'Rep' .5 'Indep' .75 'Dem' 1 'strDem'.
      recode q37 (1,2=1) (3=.5) (4,5= 0) into liberal.
      value labels liberal 1 'liberal' .5 'middle' 0 'conserv'.
      recode q37 (1=1) (2=.75) (3= .5 ) (4=.25) (5= 0) into liberal5. 
      value labels liberal5 1 'vlib' .75 'liberal'.5 'middle' .25 'conserv' 0 'vcons'.
      *Creating additional IVs-from Lab 7 or 11 or 15*.
      recode d1a (1=0) (2= .2) (3= .4) (4=.6) (5=.8) (6=1) into age.
      value labels age 0 '18+' .2 '25+' .4 '35+' .6 '45+' .8 '55+' 1 '65+'.
      recode d6 (1=1) (2=.75) (3=.5) (4=.25) (5=0) into educ.
      value labels educ 0 '<hs' .25 'hs' .5 'col' .75 'grad' 1 'post'.
      recode d10 (1 =0) (2=.17) (3=.34) (4=.5) (5=.66) (6=.83) (7=1) into income.
      value labels income 0 '<$20k' .17 '$20k+' .34 '$40k+' .5 '$60k+' .66 '$80k+' .83 '$100k+' 1 '$200k+' .
      recode q38 (1=1) (2=.66) (3=.33) (4=0) into interest.
      value labels interest 0 'none' .33 'only a little' .66 'fair amount' 1 'great deal'.
      recode q39 (1=1) (2=.75) (3=.5) (4=.25) (5=0) into vote.
      value labels vote 0 'never' .25 'seldom' .5 'part time' .75 'nearly' 1 'always'.
      corr RawMJ3 liberal5 age educ interest.
      regression variables=RawMJ3 liberal5 age educ interest
        /statistics anova coeff r tol
        /descriptives = n
        /dependent = RawMJ3
        /method = enter.

Syntax Legend

  • The relevant variables are recoded into new variable names and missing values are declared. Note that not all the possible IVs identified in the syntax are used in this regression.
  • A correlation matrix is run to examine the relationships between the DV and the IVs as well as among the IVs
  • The regression procedure identifies the variables to be used in the equation.
  • The statistics subcommand asks for the output to include anova, basic regression and correlation coefficients as well as the tolerances (tol), a collinearity diagnostic measure.
  • The descriptives subcommand asks for output to indicate the number of cases on which the regression is calculated.
  • The dependent subcommand indicates that the RawMJ3 is the dependent variable.
  • The method subcommand says to enter the variables into the equation.

SPSS Output

Correlation Procedure

RawMJ3 liberal5 age educ interest
RawMJ3 1.000
liberal5 .361 1.000
age -.209 -.132 1.000
educ- -.120 -.146 -.043 1.000
interest .122  .079  .147 -.316 1.000

Regression Procedure

Correlations (note this is inaptly named-actually Ns)
RawMJ3 liberal5 age educ interest
N RawMJ3 476 476 476 476 476
liberal5 476 476 476 476 476
age 476 476 476 476 476
educ 476 476 476 476 476
interest 476 476 476 476 476
Variables Entered/Removeda
Model Variables Entered Variables Removed Method
1 interest, liberal5, educ, ageb . Enter
a. Dependent Variable: RawMJ3
b. All requested variables entered.


Model Summary
Model R R Square Adjusted R Square Std. Error of the Estimate
1 .424a .180 .173 1.06900
a. Predictors: (Constant), interest, liberal5, educ, age


Model Sum of Squares df Mean Square F Sig.
1 Regression 117.684 4 29.421 25.745 .000b
Residual 537.914 471 1.143
Total 655.598 475
a. Dependent Variable: RawMJ3
b. Predictors: (Constant), interest, liberal5, educ,

(Regression) Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients t Sig.
B Std. Error Beta Tolerance
1 (Constant) .774 .209 3.703 .000
liberal5 1.367 .160 .365 8.546 .000 .953
age -.473 .156 -.133 -3.037 .003 .908
educ -.234 .185 -.055 -1.269 .205 .943
interest .435 .187 .102 2.324 .021 .903

a. Dependent Variable: RawMJ3

Interpretation of output

The correlation procedure results show some moderate relationships between DV & IVs, and that the IVs are conceptually distinct from the DV. There are no strong relationships among the IVs, suggesting little concern that two or more IVs are measuring essentially the same thing. These results suggest the tolerance scores in the regression analysis are not likely to pose a problem.

The regression procedure produces an inaptly named table entitled correlations. It actually shows the number of cases (N) on which the regression is based.

The model summary table reports an R-square value indicating explained variance of approximately 18%.

The ANOVA table is used to assess the significance of the overall model. In this case it is .000, indicating a very small chance of the results being due to sampling error. As with bivariate regression the ratio of explained (regression) variance to total variance is the way we calculate R-square (117.7/655.6=.1795=.18),

In the coefficients table the b values indicate the direction and amount of change in the dependent variable associated with a single unit’s change in each independent variable.  In this example, the indicators for all the IVs  are measured at the ordinal level, albeit with differing numbers of categories. Ideology (liberal5) and education have 5, age has 6 and interest has 4. The regression results show that attitudes toward recreational marijuana depend to a significant degree on all these independent variables except for education. Moreover, the b value for ideology (liberal5) and interest are positive, so higher values on these predictors are associated with more support for recreational marijuana. Hence a one unit value increase in ideology (as measured by liberal5) produced a bit over a 1 category increase in support on the RawMJ3 index.  By comparison a one unit increase in interest produces less than a half-unit increase on the DV. The b -values for age and education are both negative, so higher levels of each of these independent variables are associated with lower values on the dependent variable. Both age and education have more categoreis than either ideology or interest. Since the independent variables are not measured on the same scale, however, the b values cannot be directly compared.

The Beta values indicate the relative influence of the variables in comparable (standard deviation) units.  We can see from the Beta value for liberal5 that ideology has have a greater impact on the RawMJ3 index than do any of the other predictors. Age comes in second with interest third. Education is, of course, insignificant and hence not appreciably different than zero.

The significance of the individual independent variables is indicated by a version of the T-test. The T-ratio (or score) is calculated by dividing the b value by the standard error of b.  As is usual in significance testing, a T-ratio reaches the .05 level of statistical significance at an absolute (ignoring + or -) value of 1.96. The first two variables easily exceed this value and therefore we can be confident that their respective relationships with the DV are not due to chance. The T-ratio for interest is less impressive but still greater than 2 (1.96 to be precise) and hence also significant. Education’s T-ratio is -1.27, signifying the relationship could well be due to chance and hence is regarded as statistically insignificant.

The tolerance levels indicate no cause for concern over collinearity due to a correlation between the independent variables.

The constant (or y-intercept) indicates the value of  ‘a’ in the regression equation.

One can write the regression equation using the information provided in the output detailing the a and b values. The regression equation here is :

RawMJ3 = .774 +1.367(liberal5) – .473(age) -.234(educ) +.435(interest)

This equation can be used to predict attitudes to recreational marijauana for different combinations of values on the independent variables, e.g. those who are very conservative, in the third age category (35-44) with middle level of education (college) and a fair amount of interest in politics.finances. This is rarely of concern in theoretically based social science research (nomothetic) which focuses more upon estimating the relationships between independent and dependent variables than on understanding individual cases.


Multiple Regression

  1. Select an available public opinion dataset of interest and review the questionaire.
  2. Hypothesize a relationship with a dependent variable and at least two independent   The variables can be either interval or ordinal or a nominal variable coded as a dichotomy.
    • For example, as shown in the example shown above, partisan feelings and personal finances may both affect income levels.
  3. Based on a Frequency run, decide how to recode each variable (if necessary) and declare missing values.
  4. You may wish to create a correlation matrix with your variables to ensure that your independent variables are related to your dependent variable (and to ensure that your independent variables are not so closely related to one another that multi-collinearity will present a problem).
  5. Create and run the appropriate syntax in SPSS to run a regression analysis with two independent variables.
  6. Based on the multiple regression output, determine whether the overall equation is significant (Sig.<.05) and if so whether the independent variables have significant effects on the dependent variable.
  7. Perhaps add a third independent variable to your regression. In the example used here you might try one of the following variables:
    *Additional IVs*.
    missing values dem_agegrp_iwdate (-9 thru -1).
    missing values inc_incgroup_pre (-9 thru -1).
    missing values dem_edugroup (-9, -2).

Example #2: Multiple Regression working with subsets of cases.


Age and interest in politics may have a greater influence on attitudes toward recreational marijuana among men than among women.


*Create female indicator.
recode gender (1=0) (2=1) into female.

*Re-run the regression within gender groupings*.
select if female = 0.
regression variables=RawMJ3 liberal5 age educ interest
 /statistics anova coeff r tol
 /descriptives = n
 /dependent = RawMJ3
 /method = enter.

select if female = 1.
regression variables=RawMJ3 liberal5 age educ interest
 /statistics anova coeff r tol
 /descriptives = n
 /dependent = RawMJ3
 /method = enter.

Syntax Legend

  • These commands must be used in conjunction with the recodes used in the prior example.
  • The Temporary and Select if commands are used to select subsets of cases. In this case, subsetting allows us to run the same regression analysis separately for women and men. Respondents’ gender is distinguished using the dichotomous variable Female crated to clarify the direction of coding.
  • As in the prior example, the regressions again estimate the relative and joint effect of ideology, age education and interest in politics on attitudes toward recreational marijuana. However, in this case, separate regressions are run for male and female respondents.
  • Although the syntax requests an Anova table, Ns and a list of variables entered, these have been omitted from the output presented below.

SPSS Output

The output appears in two sections. In the first portion Female=1 thereby selecting only female respondents (N=476). The second portion selects for cases in which Female=0, so only males are included (N=475).

Female = 0
DV = RawMJ3

Model Summary
Model R R Square Adjusted R Square Std. Error of the Estimate
1 .459a .210 .204 .98805
a. Predictors: (Constant), interest, liberal5, age, educ


Model Unstandardized Coefficients Standardized Coefficients t Sig.
B Std. Error Beta Tol
1 (Constant) .995 .206 4.837 .000
liberal5 1.214 .152 .333 7.962 .000 .958
age -.730 .141 -.215 -5.167 .000 .969
educ -.200 .169 -.051 -1.188 .235 .913
interest .832 .169 .210 4.909 .000 .920

Female = 1
DV= RawMJ3

Model Summary
Model R R Square Adjusted R Square Std. Error of the Estimate
1 .424a .180 .173 1.06900
a. Predictors: (Constant), interest, liberal5, educ, age


Model Unstandardized Coefficients Standardized Coefficients t Sig.
B Std. Error Beta Tol
1 (Constant) .774 .209 3.703 .000
liberal5 1.367 .160 .365 8.546 .000 .953
age -.473 .156 -.133 -3.037 .003 .908
educ -.234 .185 -.055 -1.269 .205 .943
interest .435 .187 .102 2.324 .021 .903

Interpretation of Ouput:

The equation for males accounts for about 20 percent of the variation in the dependent variable. The signs on all the coefficients are as before. Again ideology, age and political interest are significant with education insignificant. The Beta coefficients again show ideology is a more important as a predictor of views on recreational marijuana than age or political interest.
The equation for females accounts for a bit less of the variance, approximately 17 percent. The signs on the coefficients remain the same, however the size of the significant coefficients differ somewhat from those of males, as do the levels of significance on age and interest.

Comparing the Beta coefficients across the two gender groups suggests that ideology may have a slightly larger effect  on attitudes toward marijuana among females and males, and that age and political interest have a less of an influence among females. One can similarly compare the constants.

Optional Technical Details:
Checking the b values and their associated standard errors suggests that these differences are likely due to chance since the confidence intervals overlap when one considers each in the context of +/- 1.96 its relevant standard deviation. Standard errors are the standard deviation of the sampling distribution for the variable. They are calculated by dividing the variable’s standard deviation by the square root of N. So increasing sample size decreases standard errors. In any case, by this rigorous standard the coefficients for age and interest may more nearly approach but do not quite reach significance. The calculations for interest follow.
Confidence interval for males: .832 +/- (1.96 x .169) = .501 thru 1.163.
Confidence interval for females: .435 +/- (1.96 x .187) = .068 thru .802

Example # 3 Multiple Regression with worlddata (Aggregate) Data

REGRESSION variables = IncomeShareTop10 CivilLiberties TransparencyIndex
   /statistics coeff r tol/descriptives =n
   /dependent= IncomeShareTop10 /method=enter.

Comments on Aggregate Data Syntax

The regression command lists both dependent and independent variables.

The statistics subcommand asks for regression coefficients, explained variance (r), and tolerance. Anova has been omitted but can be reinserted.

The descriptives subcommand asks for the number of cases used in the regression.

The dependent variable is declared with the dependent subcommand.

The method subcommand indicates that all the independent variables should be entered together.

Example # 3 Output

Income held by highest 10%. Freedom House score Transparency IIndex
N Income share held by highest 10%. 123 123 123
Freedom House score 123 123 123
Transparency Index, 123 123 123
Model Summary
Model R R Square Adjusted R Square Std. Error of the Estimate
1 .365a .133 .119 6.29900
a. Predictors: (Constant), Transparency Index, Freedom House score


Model Unstandardized Coefficients Standardized Coefficients Sig.
B Std. Error Beta Tolerance
1 (Constant) 35.660 1.504 .000
Freedom House .054 .055 .118 .334 .487
Transparency Index, -1.429 .396 -.440 .000 .487

Interpretation of output
Adjusted R-square indicates explained variance of approximately 12%.

The b values indicate the direction and number of units (as coded) of change in the dependent variable due to a one unit change in each independent variable. The Freedom House rating of civil liberties in a country is positively related to Income inequality (054). And greater transparency of a nation’s government is related to less concentration in its income (b= -1.429). Since the independent variables do not use the same measurement scale the b values cannot be directly compared.

The Beta coefficients indicate the relative influence of the variables in comparable (standard deviation) units.  The transparency rating has roughly four times the influence of the freedom rating on the DV.

The tolerance scores indicate that the independent variables are likely correlated but not to such an extent that they measure the same thing.

The influence of the freedom score is no greater than one would expect due to chance. In contrast, the transparency rating is statistically significant.

The constant (or y-intercept) indicates the value of  ‘a’ in the regression equation.

With this information one can write the regression equation:

Income inequality = .054(freedom) – 1.429(transparency)


  • What is the difference between Pearson’s r analysis and multiple regression?
  • Why do the values of the coefficients differ based on the combination of the independent variables that are included in the analysis?
  • How can we visualize the results of a multiple regression equation?


  • Multiple regression is distinct from Pearson’s correlation insofar as it allows us to determine the relative effects of an independent variable upon a given dependent variable while controlling for the effect of all the other variables in the equation.  In contrast, correlation analysis only allows us to compare the uncontrolled relationships between two variables.
  • There may be some change in the value of the coefficients when different combinations of variables are included in the regression because the analysis controls for the effects of all the other variables included in the equation.
  • A three dimensional scatterplot can be created using:
   /scatterplot(xyz)=IV1 with DV with IV2.
   /scatterplot(xyz)= CivilLiberties IncomeShareTop10 TransparencyIndex.
   /scatterplot(xyz)=liberal5 with RawMJ3 with age.

These graphs can be rotated by double clicking on the image and then 
clicking on the icon it the top row which is seventh from the left.