new Lab 17
Poli 101
LAB 17
Multiple Regression
PURPOSE
 To learn how to perform and interpret multiple regression analysis.
 To understand the meaning and use of Beta.
 To learn about different regression procedures.
MAIN POINTS
Multiple Regression
 Regression analysis can be performed with more than one independent variable. Regression involving two or more independent variables is called multiple regression.
 Hence, the multiple regression equation takes the form:
 y = a + b_{1}x_{1} + b_{2}x_{2} +b_{3}x_{3} … + b_{n}x_{n}
 Dependent Variable = Constant + (Coefficient_{1} x Independent Variable_{1}) + (Coefficient_{2} x Independent Variable_{2}) + (Coefficient_{3} x Independent Variable_{3}) + … etc.
 y is the predicted value of the dependent variable given the values of the independent variables (x_{1}, x_{2,} x_{3}…etc.).
 What is unique about multiple regression is that for each of the independent variables, the analysis controls for the effect of the other independent variables. This means that the effect of any independent variable is estimated apart from the effect of all the other independent variables. In this way, it accomplishes the same sort of controlled comparisons as control tables.
 Interpretation of the unstandardized coefficients (b) is much as it is in bivariate regression, i.e., a unit of change in the independent variable affects the dependent variable by a factor indicated by the regression coefficient (b). Only in this case it does so while controlling for the effects of the other independent variables.
 The R^{2} value for regression is similar to the r^{2} in bivariate regression. The R^{2} for multiple regression indicates the proportion of variation in the dependent variable explained collectively by the set of independent variables taken. R^{2} increases with each independent variable added to the regression model, even when the added variables have no effect on the dependent variable. Therefore we use an Adjusted R^{2 }that corrects for this artificial inflation of R^{2} in multiple regression models .
b & Beta
 The relative size of two b values is not necessarily a good indication of which independent variable is a better predictor of the dependent variable since the magnitude of b depends on its particular units of measurement. We can often make better comparisons among independent variables by standardizing the variables. This is done by converting the values of an independent variable into units of standard deviation from its mean.
 Beta is the standardized version of b. It indicates the effect that one standard deviation unit change in the independent variable has on the dependent variable (also measured in standard deviation units). The use of Beta coefficients facilitates comparisons among independent variables since they are all expressed in standardized units.
 Values of b and Beta are both calculated in a multiple regression analysis. Values of b are used in formulating a multiple regression equation. However, b values do not have a common benchmark for comparison since the b values depend on how the variables are coded.
 Betas from the same multiple regression analysis can be readily compared to one another. The higher the Beta value, the stronger the relationship the respective independent variable has with the dependent variable.
 Comparing Betas from equations based on different samples can be misleading, however, since the variance of the standard errors for the samples may differ substantially. In such cases it is best to report the unstandardized b value.
 Values of b allow us to understand the theoretical importance of an independent variable. When variables are measured in concrete units like dollars, years, or percentages, b is relatively easy to interpret because it expresses the potential effects of an independent variables on the dependent variable both in their original units of measurement. The meaning of Beta is not intuitively clear and cannot be interpreted concretely, but when independent variables are measured in different units only Betas allow us to compare directly the effects of different independent variables on the dependent variable.
Multicollinearity
 When some of the independent variables are very closely related to one another, multiple regression does not allow us to separate out their independent effects on the dependent variable. This is referred to as Multicollinearity (several things being on the same regression line).
 Multicollinearity usually occurs when two or more independent variables measuring the same concept are entered into the regression equation. It often results in strong, but statistically insignificant, regression coefficients (due to large standard errors). We look for multicollinearity either by examining the correlations among our independent variables or, more rigorously, by requesting tolerance measures (tol) as part of our regression analysis, using the statistics subcommand.
 Tolerance levels indicate the extent to which an independent variable is related to other independent variables in the model. Its values range from zero (.00) to one (1.0). A tolerance of 1.0 means a variable is unrelated to the other independent variables. A tolerance of .00 means an independent variable is strongly related to another independent variable.
 Multicollinearity only becomes a problem as tolerance approaches zero. As a general rule, a tolerance score of .20 or less indicates that collinearity is a problem. When this is found to be the case, either eliminate one of the variables involved, or combine them into an index.
Example #1 Multiple Regression Using Public Opinion Data
Dataset:
PPIC October 2016
 Dependent Variable: Index of Support for Recreational Marijuana (RawMJ3)
 Index of Support for Recreational Marijuana (7 categories; Alpha =.777)
 Indicators: q21 recoded as MJPropD
 q36 recoded as MJLegal Partisan Identification
 Political Ideologyq36a recoded as MJTry.Independent Variables:
 Partisan Identification
 Political Ideology
 Age
 Education
 Interest
 Index of Support for Recreational Marijuana (7 categories; Alpha =.777)

 Hypotheses Arrow Diagrams
 H1: Democratic Party ID → Support Recreational Marijuana
 H2: Liberal Ideology → Support Recreational Marijuana
 H3: Young → Support Recreational Marijuana
 H4: Educated → Support Recreational Marijuana
 H6: Interested → Support Recreational Marijuana
 Syntax
*Weighting the data*.
weight by weight.*Recoding MJ Index Items*.
recode q21 (1=1) (2=0) into MJPropD.
value labels MJPropD 1 ‘yes’ 0 ‘no’. recode q36 (1=1) (2=0) into MJLegalD.
value labels MJLegalD 1 ‘yes’ 0 ‘no’. recode q36a (1=1) (2=.5) (3=.0) into MJTry.
value labels MJTry 1 ‘recent’ .5 ‘not recent’ 0 ‘no’.*Constructing an Index with alpha = .777*.
compute RawMJ3 = (MJPropD + MJLegalD + MJTry).*Creating IV Indicators of Party Identification & Ideology*.
recode q40c (1=0) (3=.5) (2=1) into Democrat.
value labels Democrat 1 ‘Democ’ .5 ‘Indep’ 0 ‘Repub’.*Democrat5 (adapted from from lab 7)*. if (q40c = 1) and (q40e =1) Democrat5 =0. if (q40c = 1) and (q40e =2) Democrat5 =.25. if (q40c = 3) Democrat5 =.5. if (q40c =2) and (q40d =2) Democrat5 = .75. if (q40c =2) and (q40d=1) Democrat5 =1. value labels Democrat5 0 'strRep' .25 'Rep' .5 'Indep' .75 'Dem' 1 'strDem'.
recode q37 (1,2=1) (3=.5) (4,5= 0) into liberal. value labels liberal 1 'liberal' .5 'middle' 0 'conserv'.
recode q37 (1=1) (2=.75) (3= .5 ) (4=.25) (5= 0) into liberal5. value labels liberal5 1 'vlib' .75 'liberal'.5 'middle' .25 'conserv' 0 'vcons'. *Creating additional IVsfrom Lab 7 or 11 or 15*. recode d1a (1=0) (2= .2) (3= .4) (4=.6) (5=.8) (6=1) into age. value labels age 0 '18+' .2 '25+' .4 '35+' .6 '45+' .8 '55+' 1 '65+'. recode d6 (1=1) (2=.75) (3=.5) (4=.25) (5=0) into educ. value labels educ 0 '<hs' .25 'hs' .5 'col' .75 'grad' 1 'post'. recode d10 (1 =0) (2=.17) (3=.34) (4=.5) (5=.66) (6=.83) (7=1) into income. value labels income 0 '<$20k' .17 '$20k+' .34 '$40k+' .5 '$60k+' .66 '$80k+' .83 '$100k+' 1 '$200k+' . recode q38 (1=1) (2=.66) (3=.33) (4=0) into interest. value labels interest 0 'none' .33 'only a little' .66 'fair amount' 1 'great deal'. recode q39 (1=1) (2=.75) (3=.5) (4=.25) (5=0) into vote. value labels vote 0 'never' .25 'seldom' .5 'part time' .75 'nearly' 1 'always'.
corr RawMJ3 liberal5 age educ interest.
regression variables=RawMJ3 liberal5 age educ interest /statistics anova coeff r tol /descriptives = n /dependent = RawMJ3 /method = enter.
 Hypotheses Arrow Diagrams
Syntax Legend
 The relevant variables are recoded into new variable names and missing values are declared. Note that not all the possible IVs identified in the syntax are used in this regression.
 A correlation matrix is run to examine the relationships between the DV and the IVs as well as among the IVs
 The regression procedure identifies the variables to be used in the equation.
 The statistics subcommand asks for the output to include anova, basic regression and correlation coefficients as well as the tolerances (tol), a collinearity diagnostic measure.
 The descriptives subcommand asks for output to indicate the number of cases on which the regression is calculated.
 The dependent subcommand indicates that the RawMJ3 is the dependent variable.
 The method subcommand says to enter the variables into the equation.
SPSS Output
Correlation Procedure
Correlations  
RawMJ3  liberal5  age  educ  interest  
RawMJ3  1.000  
liberal5  .361  1.000  
age  .209  .132  1.000  
educ  .120  .146  .043  1.000  
interest  .122  .079  .147  .316  1.000 
Regression Procedure
Correlations (note this is inaptly namedactually Ns)  
RawMJ3  liberal5  age  educ  interest  
N  RawMJ3  476  476  476  476  476 
liberal5  476  476  476  476  476  
age  476  476  476  476  476  
educ  476  476  476  476  476  
interest  476  476  476  476  476 
Variables Entered/Removed^{a}  
Model  Variables Entered  Variables Removed  Method 
1  interest, liberal5, educ, age^{b}  .  Enter 
a. Dependent Variable: RawMJ3  
b. All requested variables entered. 
Model Summary  
Model  R  R Square  Adjusted R Square  Std. Error of the Estimate 
1  .424^{a}  .180  .173  1.06900 
a. Predictors: (Constant), interest, liberal5, educ, age 
ANOVA^{a}  
Model  Sum of Squares  df  Mean Square  F  Sig.  
1  Regression  117.684  4  29.421  25.745  .000^{b} 
Residual  537.914  471  1.143  
Total  655.598  475  
a. Dependent Variable: RawMJ3  
b. Predictors: (Constant), interest, liberal5, educ, 
(Regression) Coefficients^{a}
Model  Unstandardized Coefficients  Standardized Coefficients  t  Sig.  
B  Std. Error  Beta  Tolerance  
1  (Constant)  .774  .209  3.703  .000  
liberal5  1.367  .160  .365  8.546  .000  .953  
age  .473  .156  .133  3.037  .003  .908  
educ  .234  .185  .055  1.269  .205  .943  
interest  .435  .187  .102  2.324  .021  .903 
a. Dependent Variable: RawMJ3
Interpretation of output
The correlation procedure results show some moderate relationships between DV & IVs, and that the IVs are conceptually distinct from the DV. There are no strong relationships among the IVs, suggesting little concern that two or more IVs are measuring essentially the same thing. These results suggest the tolerance scores in the regression analysis are not likely to pose a problem.
The regression procedure produces an inaptly named table entitled correlations. It actually shows the number of cases (N) on which the regression is based.
The model summary table reports an Rsquare value indicating explained variance of approximately 18%.
The ANOVA table is used to assess the significance of the overall model. In this case it is .000, indicating a very small chance of the results being due to sampling error. As with bivariate regression the ratio of explained (regression) variance to total variance is the way we calculate Rsquare (117.7/655.6=.1795=.18),
In the coefficients table the b values indicate the direction and amount of change in the dependent variable associated with a single unit’s change in each independent variable. In this example, the indicators for all the IVs are measured at the ordinal level, albeit with differing numbers of categories. Ideology (liberal5) and education have 5, age has 6 and interest has 4. The regression results show that attitudes toward recreational marijuana depend to a significant degree on all these independent variables except for education. Moreover, the b value for ideology (liberal5) and interest are positive, so higher values on these predictors are associated with more support for recreational marijuana. Hence a one unit value increase in ideology (as measured by liberal5) produced a bit over a 1 category increase in support on the RawMJ3 index. By comparison a one unit increase in interest produces less than a halfunit increase on the DV. The b values for age and education are both negative, so higher levels of each of these independent variables are associated with lower values on the dependent variable. Both age and education have more categoreis than either ideology or interest. Since the independent variables are not measured on the same scale, however, the b values cannot be directly compared.
The Beta values indicate the relative influence of the variables in comparable (standard deviation) units. We can see from the Beta value for liberal5 that ideology has have a greater impact on the RawMJ3 index than do any of the other predictors. Age comes in second with interest third. Education is, of course, insignificant and hence not appreciably different than zero.
The significance of the individual independent variables is indicated by a version of the Ttest. The Tratio (or score) is calculated by dividing the b value by the standard error of b. As is usual in significance testing, a Tratio reaches the .05 level of statistical significance at an absolute (ignoring + or ) value of 1.96. The first two variables easily exceed this value and therefore we can be confident that their respective relationships with the DV are not due to chance. The Tratio for interest is less impressive but still greater than 2 (1.96 to be precise) and hence also significant. Education’s Tratio is 1.27, signifying the relationship could well be due to chance and hence is regarded as statistically insignificant.
The tolerance levels indicate no cause for concern over collinearity due to a correlation between the independent variables.
The constant (or yintercept) indicates the value of ‘a’ in the regression equation.
One can write the regression equation using the information provided in the output detailing the a and b values. The regression equation here is :
RawMJ3 = .774 +1.367(liberal5) – .473(age) .234(educ) +.435(interest).
This equation can be used to predict attitudes to recreational marijauana for different combinations of values on the independent variables, e.g. those who are very conservative, in the third age category (3544) with middle level of education (college) and a fair amount of interest in politics.finances. This is rarely of concern in theoretically based social science research (nomothetic) which focuses more upon estimating the relationships between independent and dependent variables than on understanding individual cases.
INSTRUCTIONS
Multiple Regression
 Select an available public opinion dataset of interest and review the questionaire.
 Hypothesize a relationship with a dependent variable and at least two independent The variables can be either interval or ordinal or a nominal variable coded as a dichotomy.
 For example, as shown in the example shown above, partisan feelings and personal finances may both affect income levels.
 Based on a Frequency run, decide how to recode each variable (if necessary) and declare missing values.
 You may wish to create a correlation matrix with your variables to ensure that your independent variables are related to your dependent variable (and to ensure that your independent variables are not so closely related to one another that multicollinearity will present a problem).
 Create and run the appropriate syntax in SPSS to run a regression analysis with two independent variables.
 Based on the multiple regression output, determine whether the overall equation is significant (Sig.<.05) and if so whether the independent variables have significant effects on the dependent variable.
 Perhaps add a third independent variable to your regression. In the example used here you might try one of the following variables:
*Additional IVs*.
*Demographics*.
missing values dem_agegrp_iwdate (9 thru 1).
missing values inc_incgroup_pre (9 thru 1).
missing values dem_edugroup (9, 2).
Example #2: Multiple Regression working with subsets of cases.
Hypotheses:
Age and interest in politics may have a greater influence on attitudes toward recreational marijuana among men than among women.
Syntax:
*Create female indicator. recode gender (1=0) (2=1) into female. *Rerun the regression within gender groupings*. temporary. select if female = 0. regression variables=RawMJ3 liberal5 age educ interest /statistics anova coeff r tol /descriptives = n /dependent = RawMJ3 /method = enter. temporary. select if female = 1. regression variables=RawMJ3 liberal5 age educ interest /statistics anova coeff r tol /descriptives = n /dependent = RawMJ3 /method = enter.
Syntax Legend
 These commands must be used in conjunction with the recodes used in the prior example.
 The Temporary and Select if commands are used to select subsets of cases. In this case, subsetting allows us to run the same regression analysis separately for women and men. Respondents’ gender is distinguished using the dichotomous variable Female crated to clarify the direction of coding.
 As in the prior example, the regressions again estimate the relative and joint effect of ideology, age education and interest in politics on attitudes toward recreational marijuana. However, in this case, separate regressions are run for male and female respondents.
 Although the syntax requests an Anova table, Ns and a list of variables entered, these have been omitted from the output presented below.
SPSS Output
The output appears in two sections. In the first portion Female=1 thereby selecting only female respondents (N=476). The second portion selects for cases in which Female=0, so only males are included (N=475).
Female = 0
DV = RawMJ3
Model Summary  
Model  R  R Square  Adjusted R Square  Std. Error of the Estimate 
1  .459^{a}  .210  .204  .98805 
a. Predictors: (Constant), interest, liberal5, age, educ 
Coefficients
Model  Unstandardized Coefficients  Standardized Coefficients  t  Sig.  
B  Std. Error  Beta  Tol  
1  (Constant)  .995  .206  4.837  .000  
liberal5  1.214  .152  .333  7.962  .000  .958  
age  .730  .141  .215  5.167  .000  .969  
educ  .200  .169  .051  1.188  .235  .913  
interest  .832  .169  .210  4.909  .000  .920 
Female = 1
DV= RawMJ3
Model Summary  
Model  R  R Square  Adjusted R Square  Std. Error of the Estimate 
1  .424^{a}  .180  .173  1.06900 
a. Predictors: (Constant), interest, liberal5, educ, age 
Cofficients
Model  Unstandardized Coefficients  Standardized Coefficients  t  Sig.  
B  Std. Error  Beta  Tol  
1  (Constant)  .774  .209  3.703  .000  
liberal5  1.367  .160  .365  8.546  .000  .953  
age  .473  .156  .133  3.037  .003  .908  
educ  .234  .185  .055  1.269  .205  .943  
interest  .435  .187  .102  2.324  .021  .903 
Interpretation of Ouput:
The equation for males accounts for about 20 percent of the variation in the dependent variable. The signs on all the coefficients are as before. Again ideology, age and political interest are significant with education insignificant. The Beta coefficients again show ideology is a more important as a predictor of views on recreational marijuana than age or political interest.
The equation for females accounts for a bit less of the variance, approximately 17 percent. The signs on the coefficients remain the same, however the size of the significant coefficients differ somewhat from those of males, as do the levels of significance on age and interest.
Comparing the Beta coefficients across the two gender groups suggests that ideology may have a slightly larger effect on attitudes toward marijuana among females and males, and that age and political interest have a less of an influence among females. One can similarly compare the constants.
Optional Technical Details:
Checking the b values and their associated standard errors suggests that these differences are likely due to chance since the confidence intervals overlap when one considers each in the context of +/ 1.96 its relevant standard deviation. Standard errors are the standard deviation of the sampling distribution for the variable. They are calculated by dividing the variable’s standard deviation by the square root of N. So increasing sample size decreases standard errors. In any case, by this rigorous standard the coefficients for age and interest may more nearly approach but do not quite reach significance. The calculations for interest follow.
Confidence interval for males: .832 +/ (1.96 x .169) = .501 thru 1.163.
Confidence interval for females: .435 +/ (1.96 x .187) = .068 thru .802
Example # 3 Multiple Regression with worlddata (Aggregate) Data
REGRESSION variables = IncomeShareTop10 CivilLiberties TransparencyIndex /statistics coeff r tol/descriptives =n /dependent= IncomeShareTop10 /method=enter.
Comments on Aggregate Data Syntax
The regression command lists both dependent and independent variables.
The statistics subcommand asks for regression coefficients, explained variance (r), and tolerance. Anova has been omitted but can be reinserted.
The descriptives subcommand asks for the number of cases used in the regression.
The dependent variable is declared with the dependent subcommand.
The method subcommand indicates that all the independent variables should be entered together.
Example # 3 Output
Correlations  
Income held by highest 10%.  Freedom House score  Transparency IIndex  
N  Income share held by highest 10%.  123  123  123 
Freedom House score  123  123  123  
Transparency Index,  123  123  123 
Model Summary  
Model  R  R Square  Adjusted R Square  Std. Error of the Estimate 
1  .365^{a}  .133  .119  6.29900 
a. Predictors: (Constant), Transparency Index, Freedom House score 
Coefficients
Model  Unstandardized Coefficients  Standardized Coefficients  Sig.  
B  Std. Error  Beta  Tolerance  
1  (Constant)  35.660  1.504  .000  
Freedom House  .054  .055  .118  .334  .487  
Transparency Index,  1.429  .396  .440  .000  .487 
Interpretation of output
Adjusted Rsquare indicates explained variance of approximately 12%.
The b values indicate the direction and number of units (as coded) of change in the dependent variable due to a one unit change in each independent variable. The Freedom House rating of civil liberties in a country is positively related to Income inequality (054). And greater transparency of a nation’s government is related to less concentration in its income (b= 1.429). Since the independent variables do not use the same measurement scale the b values cannot be directly compared.
The Beta coefficients indicate the relative influence of the variables in comparable (standard deviation) units. The transparency rating has roughly four times the influence of the freedom rating on the DV.
The tolerance scores indicate that the independent variables are likely correlated but not to such an extent that they measure the same thing.
The influence of the freedom score is no greater than one would expect due to chance. In contrast, the transparency rating is statistically significant.
The constant (or yintercept) indicates the value of ‘a’ in the regression equation.
With this information one can write the regression equation:
Income inequality = .054(freedom) – 1.429(transparency)
QUESTIONS FOR REFLECTION
 What is the difference between Pearson’s r analysis and multiple regression?
 Why do the values of the coefficients differ based on the combination of the independent variables that are included in the analysis?
 How can we visualize the results of a multiple regression equation?
DISCUSSION
 Multiple regression is distinct from Pearson’s correlation insofar as it allows us to determine the relative effects of an independent variable upon a given dependent variable while controlling for the effect of all the other variables in the equation. In contrast, correlation analysis only allows us to compare the uncontrolled relationships between two variables.
 There may be some change in the value of the coefficients when different combinations of variables are included in the regression because the analysis controls for the effects of all the other variables included in the equation.
 A three dimensional scatterplot can be created using:
graph /scatterplot(xyz)=IV1 with DV with IV2. or graph /scatterplot(xyz)= CivilLiberties IncomeShareTop10 TransparencyIndex. or graph /scatterplot(xyz)=liberal5 with RawMJ3 with age. These graphs can be rotated by double clicking on the image and then clicking on the icon it the top row which is seventh from the left.