new Lab 17 | Data Art

Poli 101

LAB 17

Multiple Regression

PURPOSE

To learn how to perform and interpret multiple regression analysis.
To understand the meaning and use of Beta.
To learn about different regression procedures.

MAIN POINTS

Multiple Regression

Regression analysis can be performed with more than one independent variable. Regression involving two or more independent variables is called multiple regression.
Hence, the multiple regression equation takes the form:
- y = a + b₁x₁ + b₂x₂ +b₃x₃ … + b_nx_n
- Dependent Variable = Constant + (Coefficient₁ x Independent Variable₁) + (Coefficient₂ x Independent Variable₂) + (Coefficient₃ x Independent Variable₃) + … etc.
y is the predicted value of the dependent variable given the values of the independent variables (x₁, x_2, x₃…etc.).
What is unique about multiple regression is that for each of the independent variables, the analysis controls for the effect of the other independent variables. This means that the effect of any independent variable is estimated apart from the effect of all the other independent variables. In this way, it accomplishes the same sort of controlled comparisons as control tables.
Interpretation of the unstandardized coefficients (b) is much as it is in bivariate regression, i.e., a unit of change in the independent variable affects the dependent variable by a factor indicated by the regression coefficient (b). Only in this case it does so while controlling for the effects of the other independent variables.
The R² value for regression is similar to the r² in bivariate regression. The R² for multiple regression indicates the proportion of variation in the dependent variable explained collectively by the set of independent variables. R² increases with each independent variable added to the regression model, even when the added variables have no effect on the dependent variable. Therefore we use an Adjusted R²that corrects for this artificial inflation of R² in multiple regression models .

b & Beta

The relative size of two b values is not necessarily a good indication of which independent variable is a better predictor of the dependent variable since the magnitude of b depends on its particular units of measurement. We can often make better comparisons among independent variables by standardizing the variables. This is done by converting the values of an independent variable into units of standard deviation from its mean.
Beta is the standardized version of b. It indicates the effect that one standard deviation unit change in the independent variable has on the dependent variable (also measured in standard deviation units). The use of Beta coefficients facilitates comparisons among independent variables since they are all expressed in standardized units.
Values of b and Beta are both calculated in a multiple regression analysis. Values of b are used in formulating a multiple regression equation. However, b values do not have a common benchmark for comparison since the b values depend on how the variables are coded.
Betas from the same multiple regression analysis can be readily compared to one another. The higher the Beta value, the stronger the relationship the respective independent variable has with the dependent variable.
Comparing Betas from equations based on different samples can be misleading, however, since the variance of the standard errors for the samples may differ substantially. In such cases it is best to report the unstandardized b value.
Values of b allow us to understand the theoretical importance of an independent variable. When variables are measured in concrete units like dollars, years, or percentages, b is relatively easy to interpret because it expresses the potential effects of an independent variables on the dependent variable, both in their original units of measurement. The meaning of Beta is not intuitively clear and cannot be interpreted concretely, but when independent variables are measured in different units only Betas allow us to compare directly the effects of different independent variables on the dependent variable.

Multicollinearity

When some of the independent variables are very closely related to one another, multiple regression does not allow us to separate out their independent effects on the dependent variable. This is referred to as Multicollinearity (several things being on the same regression line).
Multicollinearity usually occurs when two or more independent variables measuring the same concept are entered into the regression equation. It often results in strong, but statistically insignificant, regression coefficients (due to large standard errors). We look for multicollinearity either by examining the correlations among our independent variables or, more rigorously, by requesting tolerance measures (tol) as part of our regression analysis, using the statistics subcommand.
Tolerance levels indicate the extent to which an independent variable is related to other independent variables in the model. Its values range from zero (.00) to one (1.0). A tolerance of 1.0 means a variable is unrelated to the other independent variables. A tolerance of .00 means an independent variable is strongly related to another independent variable.
Multicollinearity only becomes a problem as tolerance approaches zero. As a general rule, a tolerance score of .20 or less indicates that collinearity is a problem. When this is found to be the case, either eliminate one of the variables involved, or combine them into an index.

Example #1 Multiple Regression Using Public Opinion Data

Dataset:

PPIC October 2016

Dependent Variable: Index of Support for Recreational Marijuana (RawMJ3)
- - Index of Support for Recreational Marijuana (7 categories; Alpha =.777)
    - Indicators:
      q21 recoded as MJPropD
      q36 recoded as MJLegal
      q36a recoded as MJTry
    - Independent Variables:
      - Political Ideology
      - Age
      - Education
      - Interest

Hypotheses Arrow Diagrams
- H1: Liberal Ideology → Support Recreational Marijuana
- H2: Young → Support Recreational Marijuana
- H3: Educated → Support Recreational Marijuana
- H4: Interested → Support Recreational Marijuana

Syntax

*Weighting the data*.
weight by weight.
*Recoding MJ Index Items*.
recode q21 (1=1) (2=0) into MJPropD.
value labels MJPropD 1 'yes' 0 'no'.
recode q36 (1=1) (2=0) into MJLegalD.
value labels MJLegalD 1 'yes' 0 'no'.
recode q36a (1=1) (2=.5) (3=.0) into MJTry.
value labels MJTry 1 'recent' .5 'not recent' 0 'no'.
*Constructing an Index with alpha = .777*.
compute RawMJ3 = (MJPropD + MJLegalD + MJTry).*Creating IV Indicator of  Ideology*.

recode q37 (1=1) (2=.75) (3= .5 ) (4=.25) (5= 0) into liberal5. 
value labels liberal5 1 'vlib' .75 'liberal'.5 'middle' .25 'conserv' 0 'vcons'.

*Creating additional IVs-from Lab 7 or 11 or 15*.
recode d1a (1=0) (2= .2) (3= .4) (4=.6) (5=.8) (6=1) into age.
value labels age 0 '18+' .2 '25+' .4 '35+' .6 '45+' .8 '55+' 1 '65+'.

recode d6 (1=0) (2=.25) (3=.5) (4=.75) (5=1) into educ.
value labels educ 0 '<hs' .25 'hs' .5 'col' .75 'grad' 1 'post'.

recode q38 (1=1) (2=.66) (3=.33) (4=0) into interest.
value labels interest 0 'none' .33 'only a little' .66 'fair amount' 1 'great deal'.

corr RawMJ3 liberal5 age educ interest.

regression variables=RawMJ3 liberal5 age educ interest
  /statistics anova coeff r tol
  /descriptives = n
  /dependent = RawMJ3
  /method = enter.

Syntax Legend

The relevant variables are recoded into new variable names and missing values are declared.
A correlation matrix is run to examine the relationships between the DV and the IVs as well as among the IVs
The regression procedure identifies the variables to be used in the equation.
The statistics subcommand asks for the output to include anova, basic regression and correlation coefficients as well as the tolerances (tol), a collinearity diagnostic measure.
The descriptives subcommand asks for output to indicate the number of cases on which the regression is calculated.
The dependent subcommand indicates that the RawMJ3 is the dependent variable.
The method subcommand says to enter the variables into the equation.

SPSS Output

Correlation Procedure

Correlations
		RawMJ3	liberal5	age	educ	interest
	RawMJ3	1.000
	liberal5	.361	1.000
	age	-.209	-.132	1.000
	educ-	.120	.146	.043	1.000
	interest	.122	.079	.147	.316	1.000

Regression Procedure

Correlations (note this is inaptly named-actually Ns)
		RawMJ3	liberal5	age	educ	interest
N	RawMJ3	951	951	951	951	951
	liberal5	951	951	951	951	951
	age	951	951	951	951	951
	educ	951	951	951	951	951
	interest	951	951	951	951	951

Variables Entered/Removed^a
Model	Variables Entered	Variables Removed	Method
1	interest, liberal5, educ, age^b	.	Enter
a. Dependent Variable: RawMJ3
b. All requested variables entered.

Model Summary
Model	R	R Square	Adjusted R Square	Std. Error of the Estimate
1	.441^a	.195	.191	1.03824
a. Predictors: (Constant), interest, liberal5, educ, age

ANOVA^a
Model		Sum of Squares	df	Mean Square	F	Sig.
1	Regression	246.285	4	29.421	61.571	.000^b
	Residual	1019.642	946	1.078
	Total	1265.928	950
a. Dependent Variable: RawMJ3
b. Predictors: (Constant), interest, liberal5, age educ,

(Regression) Coefficients^a

Model		Unstandardized Coefficients		Standardized Coefficients	t	Sig.
Model		B	Std. Error	Beta	t	Sig.	Tolerance
1	(Constant)	.654	.123		5.325	.000
	liberal5	1.283	.111	.344	11.525	.000	.958
	age	-.473	.156	-.133	-6.132	.000	.945
	educ	.218	.126	.052	1.733	.083	.929
	interest	.707	.126	..171	5.606	.000	.917

a. Dependent Variable: RawMJ3

Interpretation of output

The correlation procedure results show some moderate relationships between DV & IVs, and that the IVs are conceptually distinct from the DV. There are no strong relationships among the IVs, suggesting little concern that two or more IVs are measuring essentially the same thing. These results and the high tolerance scores (>.2) in the regression analysis suggest that collinearity is not likely to pose a problem.

The regression procedure produces an inaptly named table entitled correlations. It actually shows the number of cases (N) on which the regression is based.

The model summary table reports an R-square value indicating explained variance of approximately 19%.

The ANOVA table is used to assess the significance of the overall model. In this case it is .000, indicating a very small chance of the results being due to sampling error. As with bivariate regression the ratio of explained (regression) variance to total variance is the way we calculate R-square (246.3/1265.9=.1945=.19),

In the coefficients table the b values indicate the direction and amount of change in the dependent variable associated with a single unit’s change in each independent variable. In this example, the indicators for all the IVs are measured at the ordinal level, albeit with differing numbers of categories. Ideology (liberal5) and education have 5, age has 6 and interest has 4. The regression results show that attitudes toward recreational marijuana depend to a significant degree on all these independent variables except for education. Moreover, the b value for ideology (liberal5) and interest are positive, so higher values on these predictors are associated with more support for recreational marijuana. Hence a one unit value increase in ideology (as measured by liberal5) produced a bit over a 1 category increase in support on the RawMJ3 index. By comparison a one unit increase in interest produces seven-tenths of a unit increase on the DV whereas a unit of education produces two-tenths a unit change in MJ3. The b -value for age is negative, so higher levels of this independent variable is associated with lower values on the dependent variable. Both age and education have more categories than either ideology or interest. Since the independent variables are not measured on the same scale, however, the b values cannot be directly compared.

The Beta values indicate the relative influence of the variables in comparable (standard deviation) units. We can see from the Beta value for liberal5 that ideology has a greater impact on the RawMJ3 index than do any of the other predictors. Interest comes in second with age third. Education is, of course, insignificant and hence not appreciably different than zero.

The significance of the individual independent variables is indicated by a version of the T-test. The T-ratio (or score) is calculated by dividing the b value by the standard error of b. As is usual in significance testing, a T-ratio reaches the .05 level of statistical significance at an absolute (ignoring + or -) value of 1.96. The first two and last independent variables easily exceed this value and therefore we can be confident that their respective relationships with the DV are not due to chance. The T-ratio for education is somewhat less than the benchmark of 2 (1.96 to be precise) and hence not quite significant. Education’s T-ratio is -1.73, signifying the relationship could well be due to chance and hence is regarded as marginally significant at best, or more likely insignificant.

The tolerance levels all exceed .9 and are well over the .2 threshold. Thus there is no cause for concern over collinearity due to a correlation between the independent variables.

The constant (or y-intercept) indicates the value of ‘a’ in the regression equation.

One can write the regression equation using the information provided in the output detailing the a and b values. The regression equation here is :

RawMJ3 = .65 +1.28(liberal5) – .47(age) -.22(educ) +.71(interest).

This equation can be used to predict attitudes to recreational marijauana for different combinations of values on the independent variables, e.g. those who are very conservative, in the third age category (35-44) with middle level of education (college) and a fair amount of interest in politics. This is rarely of concern in theoretically based social science research (nomothetic) which focuses more upon estimating the relationships between independent and dependent variables than on understanding individual cases.

INSTRUCTIONS

Multiple Regression

Select an available public opinion dataset of interest and review the questionaire.
Hypothesize a relationship with a dependent variable and at least two independent The variables can be either interval or ordinal or a nominal variable coded as a dichotomy.
- For example, as shown in the example shown above, political ideology and political interest may both affect income levels.
Based on a Frequency run, decide how to recode each variable (if necessary) and declare missing values.
You may wish to create a correlation matrix with your variables to ensure that your independent variables are related to your dependent variable (and to ensure that your independent variables are not so closely related to one another that multi-collinearity will present a problem).
Create and run the appropriate syntax in SPSS to run a regression analysis with two independent variables.
Based on the multiple regression output, determine whether the overall equation is significant (Sig.<.05) and if so whether the independent variables have significant effects on the dependent variable.
Perhaps add a third independent variable to your regression. In the example used here you might try one of the following variables:

Example #2: Multiple Regression working with subsets of cases.

Hypotheses:

Age and interest in politics may have a greater influence on attitudes toward recreational marijuana among men than among women.

Syntax:

*Create female indicator.
recode gender (1=0) (2=1) into female.

*Re-run the regression within gender groupings*.
temporary.
select if female = 0.
regression variables=RawMJ3 liberal5 age educ interest
 /statistics anova coeff r tol
 /descriptives = n
 /dependent = RawMJ3
 /method = enter.

temporary.
select if female = 1.
regression variables=RawMJ3 liberal5 age educ interest
 /statistics anova coeff r tol
 /descriptives = n
 /dependent = RawMJ3
 /method = enter.

Syntax Legend

These commands must be used in conjunction with the recodes used in the prior example.
The Temporary and Select if commands are used to select subsets of cases. In this case, subsetting allows us to run the same regression analysis separately for women and men. Respondents’ gender is distinguished using the dichotomous variable Female crated to clarify the direction of coding.
As in the prior example, the regressions again estimate the relative and joint effect of ideology, age education and interest in politics on attitudes toward recreational marijuana. However, in this case, separate regressions are run for male and female respondents.
Although the syntax requests an Anova table, Ns and a list of variables entered, these have been omitted from the output presented below.

SPSS Output

The output appears in two sections. In the first portion Female=0 thereby selecting only male respondents (N=476). The second portion selects for cases in which Female=1, so only females are included (N=475).

Female = 0
DV = RawMJ3

Model Summary
Model	R	R Square	Adjusted R Square	Std. Error of the Estimate
1	.459^a	.210	.204	.98805
a. Predictors: (Constant), interest, liberal5, age, educ

Coefficients

Model		Unstandardized Coefficients		Standardized Coefficients	t	Sig.
Model		B	Std. Error	Beta	t	Sig.	Tol
1	(Constant)	.995	.206		4.837	.000
	liberal5	1.214	.152	.333	7.962	.000	.958
	age	-.730	.141	-.215	-5.167	.000	.969
	educ	.200	.169	.051	-1.188	.235	.913
	interest	.832	.169	.210	4.909	.000	.920

Female = 1
DV= RawMJ3

Model Summary
Model	R	R Square	Adjusted R Square	Std. Error of the Estimate
1	.424^a	.180	.173	1.06900
a. Predictors: (Constant), interest, liberal5, educ, age

Cofficients

Model		Unstandardized Coefficients		Standardized Coefficients	t	Sig.
Model		B	Std. Error	Beta	t	Sig.	Tol
1	(Constant)	.774	.209		3.703	.000
	liberal5	1.367	.160	.365	8.546	.000	.953
	age	-.473	.156	-.133	-3.037	.003	.908
	educ	.234	.185	.055	-1.269	.205	.943
	interest	.435	.187	.102	2.324	.021	.903

Interpretation of Ouput:

The equation for males accounts for about 20 percent of the variation in the dependent variable. The signs on all the coefficients are as before. Again ideology, age and political interest are significant with education insignificant. The Beta coefficients again show ideology is more important as a predictor of views on recreational marijuana than age or political interest.
The equation for females accounts for a bit less of the variance, approximately 17 percent. The signs on the coefficients remain the same, however the size of the significant coefficients differ somewhat from those of males, as do the levels of significance on age and interest.

Comparing the Beta coefficients across the two gender groups suggests that ideology may have a slightly larger effect on attitudes toward marijuana among females than males, and that age and political interest have less of an influence among females. One can similarly compare the constants showing greater base rate support among males.

Optional Technical Details:
Checking the b values and their associated standard errors suggests that these differences are likely due to chance since the confidence intervals overlap when one considers each in the context of +/- 1.96 its relevant standard deviation. Standard errors are the standard deviation of the sampling distribution for the variable. They are calculated by dividing the variable’s standard deviation by the square root of N. So increasing sample size decreases standard errors. And decreasing sample size, as we do here by selecting for each gender separately, increases standard errors. In any case, by this rigorous standard the coefficients for age and interest may more nearly approach but do not quite reach significance. The calculations for interest follow.
Confidence interval for males: .832 +/- (1.96 x .169) = .501 thru 1.163.
Confidence interval for females: .435 +/- (1.96 x .187) = .068 thru .802

Example # 3 Multiple Regression with worlddata (Aggregate) Data

REGRESSION variables = IncomeShareTop10 CivilLiberties TransparencyIndex
   /statistics coeff r tol/descriptives =n
   /dependent= IncomeShareTop10 /method=enter.

Comments on Aggregate Data Syntax

The regression command lists both dependent and independent variables.

The statistics subcommand asks for regression coefficients, explained variance (r), and tolerance. Anova has been omitted but can be reinserted.

The descriptives subcommand asks for the number of cases used in the regression.

The dependent variable is declared with the dependent subcommand.

The method subcommand indicates that all the independent variables should be entered together.

Example # 3 Output

Correlations
		Income held by highest 10%.	Freedom House score	Transparency IIndex
N	Income share held by highest 10%.	123	123	123
	Freedom House score	123	123	123
	Transparency Index,	123	123	123

Model Summary
Model	R	R Square	Adjusted R Square	Std. Error of the Estimate
1	.365^a	.133	.119	6.29900
a. Predictors: (Constant), Transparency Index, Freedom House score

Coefficients

Model		Unstandardized Coefficients		Standardized Coefficients	Sig.
Model		B	Std. Error	Beta	Sig.	Tolerance
1	(Constant)	35.660	1.504		.000
	Freedom House	.054	.055	.118	.334	.487
	Transparency Index,	-1.429	.396	-.440	.000	.487

Interpretation of output
Adjusted R-square indicates explained variance of approximately 12%.

The b values indicate the direction and number of units (as coded) of change in the dependent variable due to a one unit change in each independent variable. The Freedom House rating of civil liberties in a country is positively related to Income inequality (.054). And greater transparency of a nation’s government is related to less concentration in its income (b= -1.429). Since the independent variables do not use the same measurement scale the b values cannot be directly compared.

The Beta coefficients indicate the relative influence of the variables in comparable (standard deviation) units. The transparency rating has roughly four times the influence of the freedom rating on the DV.

The tolerance scores indicate that the independent variables are likely correlated but not to such an extent that they measure the same thing.

The influence of the freedom score is no greater than one would expect due to chance. In contrast, the transparency rating is statistically significant.

The constant (or y-intercept) indicates the value of ‘a’ in the regression equation.

With this information one can write the regression equation:

Income inequality = .054(freedom) – 1.429(transparency)

QUESTIONS FOR REFLECTION

What is the difference between Pearson’s r analysis and multiple regression?
Why do the values of the coefficients differ based on the combination of the independent variables that are included in the analysis?
How can we visualize the results of a multiple regression equation?

DISCUSSION

Multiple regression is distinct from Pearson’s correlation insofar as it allows us to determine the relative effects of an independent variable upon a given dependent variable while controlling for the effect of all the other variables in the equation. In contrast, correlation analysis only allows us to compare the uncontrolled relationships between two variables.
There may be some change in the value of the coefficients when different combinations of variables are included in the regression because the analysis controls for the effects of all the other variables included in the equation.

A three dimensional scatterplot can be created using:

graph
   /scatterplot(xyz)=IV1 with DV with IV2.
or
 graph
   /scatterplot(xyz)= CivilLiberties IncomeShareTop10 TransparencyIndex.
or
graph
   /scatterplot(xyz)=liberal5 with RawMJ3 with age.

These graphs can be rotated by double clicking on the image and then 
clicking on the icon it the top row which is seventh from the left.