Lab 17 | Data Art

POL242 LAB 17

Multiple Regression

PURPOSE

To learn how to perform and interpret multiple regression analysis.
To understand the meaning and use of Beta.
To learn about different regression procedures.

MAIN POINTS

Multiple Regression

Regression analysis can be performed with more than one independent variable. Regression involving two or more independent variables is called multiple regression.
Hence, the multiple regression equation takes the form:
- Y = A + b₁x₁ + b₂x₂ +b₃x₃ … + b_nx_n
- Dependent Variable = Constant + (Coefficient₁ X Independent Variable₁) + (Coefficient₂ X Independent Variable₂) + (Coefficient₃ X Independent Variable₃) + … etc.
Y is the predicted value of the dependent variable given the values of the independent variables (X₁, X_2, X₃…etc.).
What is unique about multiple regression is that for each of the independent variables, the analysis controls for the effect of the other independent variables. This means that the effect of any independent variable is estimated apart from the effect of all the other independent variables. In this way, it accomplishes the same sort of controlled comparisons as control tables.
Interpretation of the unstandardized coefficients (b) is much as it is in bivariate regression, i.e., a unit of change in the independent variable affects the dependent variable by a factor indicated by the regression coefficient (b). Only in this case it does so while controlling for the effects of the other independent variables.
The R² value for regression is similar to the r² in bivariate regression. The R² for multiple regression indicates the proportion of variation in the dependent variable explained collectively by the set of independent variables taken. R² increases with each independent variable added to the regression model, even when the added variables have no effect on the dependent variable. Therefore we use an Adjusted R²that corrects for this artificial inflation of R² in multiple regression models .

b & Beta

The relative size of two b values is not necessarily a good indication of which independent variable is a better predictor of the dependent variable since the magnitude of b depends on its particular units of measurement. We can often make better comparisons among independent variables by standardizing the variables. This is done by converting the values of an independent variable into units of standard deviation from it mean (see Linneman Ch 11 for details)
Beta is the standardized version of b. It indicates the effect that one standard deviation unit change in the independent variable has on the dependent variable (also measured in standard deviation units). The use of Beta coefficients facilitates comparisons among independent variables since they are all expressed in standardized units.
Values of b and Beta are both calculated in a multiple regression analysis. Values of b are used in formulating a multiple regression equation. However, b values do not have a common benchmark for comparison since the b values depend on how the variables are coded.
Betas from the same multiple regression analysis can be readily compared to one another. The higher the Beta value, the stronger the relationship the respective independent variable has with the dependent variable.
Comparing Betas from equations based on different samples can be misleading, however, since the variance of the standard errors for the samples may differ substantially. In such cases it is best to report the unstandardized b value.
Values of b allow us to understand the theoretical importance of an independent variable. When variables are measured in concrete units like dollars, years, or percentages, b is relatively easy to interpret because it expresses the potential effects of an independent variables on the dependent variable in both their original units of measurement. The meaning of Beta is not intuitively clear and cannot be interpreted concretely, but when independent variables are measured in different units only Betas allow us to compare directly the effects of different independent variables on the dependent variables.

Multicollinearity

When some of the independent variables are very closely related to one another, multiple regression does not allow us to separate out their independent effects on the dependent variable. This is referred to as Multicollinearity (several things being on the same regression line). See Archer & Berdahl p 336 and Linneman pp 537-41 for discussions.
Multicollinearity usually occurs when two or more independent variables measuring the same concept are entered into the regression equation. It often results in strong, but statistically insignificant, regression coefficients (due to large standard errors). We look for multicollinearity either by examining the correlations among our independent variables or, more rigorously, by requesting tolerance measures (tol) as part of our regression analysis, using the statistics subcommand.
Tolerance levels indicate the extent to which an independent variable is related to other independent variables in the model. Its values range from zero (.00) to one (1.0). A tolerance of 1.0 means a variable is unrelated to the other independent variables. A tolerance of .00 means an independent variable is strongly related to another independent variable.
Multicollinearity only becomes a problem as tolerance approaches zero. As a general rule, a tolerance score of .20 or less indicates that collinearity is a problem. When this is found to be the case, eliminate one of the variables involved, or combine them into an index.

Example #1 Multiple Regression Using Public Opinion Data

Data Set
- CES 2011
Dependent Variable
- Egal (Alpha =.67)
  - Indicators: PES11_41; mbs11_k2; mbs11_b3
- Independent Variables
  - IV1: Confeel(cps11_18) – Feeling toward Conservatives
  - IV2: Finance (cps11_66) – Personal Finance Improving
- Hypotheses
  - X1 & X2 –> Y.
  - In this case, feeling toward Conservatives and Personal Financial Situation –> Egalitarian Attitudes
- Syntax

weight by WGTSamp.

*Preparing indicators of Attitudes re Inequality*.
*declare missing values on pes11_41*.
missing values pes11_41 (8,9).
*reverse scoring on pes11_41 and make it range from 0-1*.
recode pes11_41 (1=1) (2=.75) (3=.5) (4= .25)
    (5=0) into undogap.
value labels undogap 0 'muchless' .25 'someless' .5 'asnow'
    .75 'somemore' 1 'muchmore'.

*rescale mbs11_k2 from 0-10 to 0-1 and reverse its scoring*.
missing values mbs11_k2 (-99).
compute govact = (((mbs11_k2 * -1) +10)/10).
value labels govact 0'not act' 1 'gov act'.

*recode and re-label mbs11_b3 and pes11_52b*.
recode mbs11_b3 (1=1) (2=0) into goveqch.
value labels goveqch 1 'decent living' 0 'leave alone'.

*create an indexed variable (alpha=.66).
compute rawegal = undogap + govact + goveqch.

*interval measure of partisan feeling from Lab 7*.
recode cps11_18 (0=0) (else = copy) into ConFeel.
missing values Confeel (996, 998, 999).

*create finance measures (from Lab 7.
missing values cps11_66 (8,9).
recode cps11_66 (1=1) (3=0) (5=.5) into finances.
variable labels finances 'personal finances'.
value labels finances 0 'worse' .5 'same' 1 'better'.

*Multiple Regression Analysis*.
Regression variables = rawegal confeel finances
   /statistics anova coeff r tol
   /descriptives = n
   /dependent = rawegal
   /method = enter.

Syntax Legend

The relevant variables are recoded into new variable names and missing values are declared.
The regression procedure identifies the variables to be used in the equation.
The statistics subcommand asks for the output to include anova, basic regression and correlation coefficients as well as the tolerances (tol), a collinearity diagnostic measure.
The descriptives subcommand asks for output in indicate the number of cases on which the regression is calculated.
The dependent subcommand indicates that income is the dependent variable.
The method subcommand says to enter the variables into the equation.

SPSS Output

Number of Cases (for Correlations & Regression)
		rawegal	ConFeel	personal finances
N	rawegal	831	831	831
	ConFeel	831	831	831
	personal finances	831	831	831

Model Summary
Model	R	R Square	Adjusted R Square	Std. Error of the Estimate
1	.424^a	.180	.178	.62484
a. Predictors: (Constant), personal finances, ConFeel

Coefficients

Model		Unstandardized Coefficients		Standardized Coefficients	Sig.	Collinearity
Model		B	Std. Error	Beta	Sig.	Tolerance
	(Constant)	2.749	.050	55.451	.000
	ConFeel	-.008	.001	-11.033	.000	.969
	personal finances	-.419	.074	-5.681	.000	.969

Interpretation of output

R-square indicates explained variance of approximately 18%.

The ANOVA table has not been included here. It is used to assess the significance of the overall model.

The b values indicate the direction and amount of change in the dependent variable associated with a single unit’s change in each independent variable. In this example, the indicator for ConFeel is measured at the interval level, with scores ranging from 0 through 100, whereas the personal finances indicator is an ordinal measure with three values. The regression results show that egalitarian attitudes depend to a significant degree on both independent variables. Moreover, the b values are in each instance negative, so higher levels of each independent variable are associated with lower values on the dependent variable. Since the independent variables are not measured on the same scale, however, the b values cannot be directly compared.

The Beta values indicate the relative influence of the variables in comparable (standard deviation) units. We can see from the Beta values that feelings toward the Conservatives have a greater impact on egalitarianism than do perceptions of personal finances.

The tolerance levels indicate that there is no problem of collinearity due to a correlation between the independent variables.

The significance of the individual independent variables is indicated by a version of the T-test. The T-ratio (or score) is calculated by dividing the b value by the standard error of b. This isn’t as evident for Confeel as it is for personal finances due to rounding. (The actual b value for Confeel is -.008104 and its standard error is .000735, each of which can be seen by double-clicking on the relevant coefficients in the SPSS output.) As is usual in significance testing, a T-ratio reaches the .05 level of statistical significance at 1.96. Both variables easily exceed this value and therefore we can be confident that their respective relationships with egalitarian attitudes are not due to chance.

The constant (or y-intercept) indicates the value of ‘a’ in the regression equation.

One can write the regression equation using the information provided in the output detailing the a and b values. The regression equation here is : rawegal = 2.75 -.008ConFeel – .419personal finances. This equation can be used to predict egalitarian attitude levels for different combinations of values on the independent variables, e.g. those who rate the Conservatives at 75 with improving personal finances, though this is rarely of concern in theoretically based social science research which focuses more upon estimating the relationships between independent and dependent variables.

INSTRUCTIONS

Multiple Regression

Select an available public opinion dataset of interest and review the questionaire.
Hypothesize a relationship with a dependentvariable and at least two independent The variables can be either interval or ordinal or a nominal variable coded as a dichotomy.
- For example, as shown in the example shown above, partisan feelings and personal finances may both affect income levels.
Based on a Frequency run, decide how to recodeeach variable (if necessary) and declare missing values.
You may wish to create a correlationmatrix with your variables to ensure that your independent variables are related to your dependent variable (and to ensure that your independent variables are not so closely related to one another that multi-collinearity will present a problem).
Create and run the appropriate syntax in SPSS to run a regression analysis with two independent variables.
Based on the multiple regression output, determine whether the overall equation is significant (Sig.<.05) and if so whether the independent variables have significant effects on the dependent variable.
Perhaps add a third independent variable to your regression. In the example used here you might try age, or income or economic performance introduced in Lab 7.

Example #2: Multiple Regression working with subsets of cases.

Hypotheses:

Partisan attitudes and personal finances have a greater influence on egalitarian attitudes among men than among women.

Syntax:

*Create gender indicator from lab6*.
recode rgender11 (1=0) (5=1) into female.

*Re-run the regression within gender groupings*.
Temporary.
Select if female =1.
Regression variables = rawegal confeel finances
   /statistics coeff r tol
   /descriptives = n
   /dependent = rawegal
   /method = enter.

Temporary.
Select if female =0.
Regression variables = rawegal confeel finances
   /statistics coeff r tol
   /descriptives = n
   /dependent = rawegal
   /method = enter.

Syntax Legend

These commands must be used in conjunction with the recodes used in the prior example.
The Temporary and Select if commands are used to select subsets of cases. In this case, subsetting allows us to run the same regression analysis separately for women and men. As in previous lab examples, respondents’ gender is distinguished using the dichotomous variable Female.
As in the prior example, the regressions again estimate the relative and joint effect of partisan feelings and perceived personal finances on egalitarian attitudes. However, in this case, the separate regressions are run for male and female respondents. The Anova subcommand has not been included in this example.

SPSS Output

Female = 1.

Correlations
		rawegal	ConFeel	personal finances
N	rawegal	456	456	456
	ConFeel	456	456	456
	personal finances	456	456	456

Model Summary
Model	R	R Square	Adjusted R Square	Std. Error of the Estimate
1	.400^a	.160	.156	.57220
a. Predictors: (Constant), personal finances, ConFeel

Model		Unstandardized Coefficients		Standardized Coefficients	Sig.
Model		B	Std. Error	Beta	Sig.	Tolerance
1	Constant	2.755	.062		.000
	ConFeel	-.007	.001	-.318	.000	.973
	personal finances	-.418	.093	-.196	.000	.973

Female = 0

Correlations
		rawegal	ConFeel	personal finances
N	rawegal	375	375	375
	ConFeel	375	375	375
	personal finances	375	375	375

Model Summary
Model	R	R Square	Adjusted R Square	Std. Error of the Estimate
1	.449^a	.202	.198	.67308
a. Predictors: (Constant), personal finances, ConFeel

Model		Unstandardized Coefficients		Standardized Coefficients	Sig.
Model		B	Std. Error	Beta	Sig.	Tolerance
1	(Constant)	2.718	.079		.000
	ConFeel	-.010	.001	-.392	.000	.965
	personal finances	-.389	.116	-.159	.001	.965

Interpretation of Ouput:

The output appears in two sections. In the first portion Female=1 thereby selecting only female respondents (N=456). The second portion selects for cases in which Female=0, so only males are included (N=375).

The equation for females accounts for approximately sixteen percent of the variation in egalitarianism. Moreover both IVs are significant and negative with feelings toward Conservatives being the stronger of the two predictors.

The equation for males accounts for nearly twenty percent of the variation in the dependent variable. Both IVs are again negative and significant. The Beta coefficients again show partisan feeling is a more important as a predictor of egalitarianism than personal finance.

Comparing the Beta coefficients for ConFeel across the two groups suggests that partisan feeling has a greater impact on egalitarianism among males than it does among females. The difference in the influence of personal finances seems less marked. And checking the b values and their associated standard errors suggests that this difference may well be due to chance since the values overlap when one considers each in the context of +/- 1.96 its relevant standard deviation.
Example # 3 Multiple Regression with worlddata (Aggregate) Data

REGRESSION variables = IncomeShareTop10 CivilLiberties TransparencyIndex
   /statistics coeff r tol/descriptives =n
   /dependent= IncomeShareTop10 /method=enter.

Comments on Aggregate Data Syntax

The regression command lists both dependent and independent variables.

The statistics subcommand asks for regression coefficients, explained variance (r), and tolerance. Anova has again been omitted but can be reinserted.

The descriptives subcommand asks for the number of cases used in the regression.

The dependent variable is declared with the dependent subcommand.

The method subcommand indicates that all the independent variables should be entered together.

Example # 3 Output

Correlations
		Income held by highest 10%.	Freedom House score	Transparency IIndex
N	Income share held by highest 10%.	123	123	123
	Freedom House score	123	123	123
	Transparency Index,	123	123	123

Model Summary
Model	R	R Square	Adjusted R Square	Std. Error of the Estimate
1	.365^a	.133	.119	6.29900
a. Predictors: (Constant), Transparency Index, Freedom House score

Model		Unstandardized Coefficients		Standardized Coefficients	Sig.
Model		B	Std. Error	Beta	Sig.	Tolerance
1	(Constant)	35.660	1.504		.000
	Freedom House	.054	.055	.118	.334	.487
	Transparency Index,	-1.429	.396	-.440	.000	.487

Interpretation of output
Adjusted R-square indicates explained variance of approximately 12%.

The b values indicate the direction and number of units (as coded) of change in the dependent variable due to a one unit change in each independent variable. The Freedom House rating of civil liberties in a country is positively related to Income inequality (054). And greater transparency of a nation’s government is related to less concentration in its income (b= -1.429). Since the independent variables do not use the same measurement scale the b values cannot be directly compared.

The Beta coefficients indicate the relative influence of the variables in comparable (standard deviation) units. The transparency rating has roughly four times the influence of the freedom rating on the DV.

The tolerance scores indicate that the independent variables are likely correlated but not to such an extent that they measure the same thing.

The influence of the freedom score is no greater than one would expect due to chance. In contrast, the transparency rating is statistically significant.

The constant (or y-intercept) indicates the value of ‘a’ in the regression equation.

With this information one can write the regression equation:

Income inequality = .054(freedom) – 1.429(transparency)

QUESTIONS FOR REFLECTION

What is the difference between Pearson’s r analysis and multiple regression?
Why do the values of the coefficients differ based on the combination of the independent variables that are included in the analysis?
How can we visualize the results of a multiple regression equation?

DISCUSSION

Multiple regression is distinct from Pearson’s correlation insofar as it allows us to determine the relative effects of an independent variable upon a given dependent variable while controlling for the effect of all the other variables in the equation. In contrast, correlation analysis only allows us to compare the uncontrolled relationships between two variables.
There may be some change in the value of the coefficients when different combinations of variables are included in the regression because the analysis controls for the effects of all the other variables included in the equation.

A three dimensional scatterplot can be created using:

graph
   /scatterplot(xyz)=IV1 with DV with IV2.

graph
   /scatterplot(xyz)=ConFeel with rawegal with finances.