new Lab 16

Poli 101 LAB MANUAL16

Bivariate Regression

PURPOSE

  • To introduce regression analysis
  • To learn how to perform a regression and interpret its results.

Part I–MAIN POINTS

  • Regression is a technique that presents the relationship between two (or more) variables in the form of a simple linear function.  The regression model finds the best-fitting equation by calculating the line from which the data points have the least (squared) deviations.

In bivariate analysis, regression takes the form:

    • y = a + bx.    Where:
    • y is the dependent variable;
    • x is the independent variable;
    • b is the unstandardized regression coefficient;
    • is an intercept or constant.
    • Translating the equation into words, we have:

Value of the Dependent Variable = Intercept (Constant) + Regression Coefficient times the Value of the Independent Variable

  • The regression equation allows us to predict the approximate value of the dependent variable given any value of the independent variable.
  • We interpret regression results in terms of the implications for the dependent variable (y) of a unit change (increase or decrease) in the independent variable (x).
  • For instance, assume a regression equation is Income = 10,000 + (5,000 X Education).  Beginning from a constant of 10,000, every unit increase in education leads to a 5,000-unit increase in income.
    • Significance is determined in two ways with regression. The first is for the equation as a whole, the second is for each particular independent variable. If the significance level is greater than 0.05 for either of these measures, then we cannot infer that the increase described by the equation, or reflected in the individual regression coefficient is different than zero in the population.
    • Also of particular interest in the regression output is the r-square value. Recall from our discussion of Pearson’s correlation coefficient (r) that r2 is an estimation of the explained variance. The higher the value of r2, the better. However with only one independent variable, the r-square value will likely be relatively low even when the independent variable is significant.
    • The third important statistic in regression is the b value, which measures the effect of the independent variable on the dependent variable in terms of unit change
    • The b value is an unstandardized coefficient meaning that it is measured in the units used to describe the independent variable. There is also a standardized version of b, called beta, which allows us to interpret regression in terms of standard deviation units. In this instance, every change of one standard deviation unit in the independent variable changes the dependent variable by a factor of beta.
  • Similar to correlation, regression should technically be used only when both dependent and independent variables are measured at the interval level, though researchers very often use ordinal variables with many categories in regression as well.This is particularly true for the dependent variable where having more than a few possible categories provides greater variation to explain. So in our own work it is usually desirable to use an index as the dependent variable.
  • As to independent variables there is a generally a bit more latitude with the level of measurement insofar as researchers commonly use dichotomies (coded as zero or one) created from nominal data as independent variables. This will be the subject of lab 18.

EXAMPLE 

Calculating Regression

  • Dataset:
    • PPIC October 2016
  • Dependent Variable:Index of Support for Recreational Marijuana (RawMJ3)
      • Index of Support for Recreational Marijuana (7 categories; Alpha =.777)
        • Indicators: q21 recoded as MJPropD
        • q36 recoded as  MJLegalD
        • q36a recoded as MJTry.
  • Independent Variables:
    • Partisan Identification
    • Political Ideology
  • Hypotheses Arrow Diagrams:
    • H1: Democratic Party ID → Support Recreational Marijuana (5 categories coded 0-3)
    • H2: Liberal Ideology → Support for Recreational Marijuana (7 categories coded 0-3)
  • Syntax
*Weighting the Data*.
weight by weight.
*Recoding MJ Index Items*.
recode q21 (1=1) (2=0) into MJPropD.
value labels MJPropD 1 'yes' 0 'no'.
recode q36 (1=1) (2=0) into MJLegalD.
value labels MJLegalD 1 'yes' 0 'no'.
recode q36a (1=1) (2=.5) (3=.0) into MJTry.
value labels MJTry 1 'recent' .5 'not recent' 0 'no'.

*Constructing an Index with alpha = .777*.
compute RawMJ3 = (MJPropD + MJLegalD + MJTry).

*Creating IV Indicators of Party Identification & Ideology*.
recode q40c (1=0) (3=.5) (2=1) into Democrat.
value labels Democrat 1 'Democ' .5 'Indep' 0 'Repub'.
*Democrat5 (adapted from from lab 7)*.
if (q40c = 1) and (q40e =1) Democrat5 =0.
if (q40c = 1) and (q40e =2) Democrat5 =.25.
if (q40c = 3) Democrat5 =.5.
if (q40c =2) and (q40d =2) Democrat5 = .75.
if (q40c =2) and (q40d=1) Democrat5 =1.
value labels Democrat5 0 'strRep' .25 'Rep' .5 'Indep' .75 'Dem' 1 'strDem'.
recode q37 (1,2=1) (3=.5) (4,5= 0) into liberal.
value labels liberal 1 'liberal' .5 'middle' 0 'conserv'.
recode q37 (1=1) (2=.75) (3= .5 ) (4=.25) (5= 0) into liberal5. 
value labels liberal5 1 'vlib' .75 'liberal'.5 'middle' .25 'conserv' 0 'vcons'.
 
regression variables=RawMJ3 Democrat5
  /dependent =  RawMJ3
  /method = enter.
regression variables=RawMJ3 liberal5
  /dependent = RawMJ3
  /method = enter.
  • Syntax Legend
    • Missing values and recodes are declared & DV index constructed.
    • Two separate regressions are programmed to run.
    • The regression commands’ first line specifies the included variables.
    • The second line specifies the dependent variable. Note the raw (unrecoded) index is used.
    • The third line says to enter the other variable specified as a predictor.

      Output for First Regression
    • Model Summary
      Model R R Square Adjusted R Square Std. Error of the Estimate
      1 .209a .044 .043 1.12629
      a.    Predictors: (Constant), Democrat5
ANOVAa
Model Sum of Squares df Mean Square F Sig.
1 Regression 55.490 1 55.490 43.744 .000b
Residual 1218.702 961 1.269
Total 1274.191 962
a. Dependent Variable: RawMJ3
b. Predictors: (Constant), Democrat5
Coefficientsa
Model Unstandardized Coefficients Standardized Coefficients t Sig.
B Std. Error Beta
1 (Constant) 1.108 .072 15.355 .000
Democrat5 .734 .111 .209 6.614 .000
a. Dependent Variable: RawMJ3
  • Interpretation

    • The first table summarizes the regression model. The important interpretative elements here are the value of r and r-square. Note that these are both capitalized in the output but not in writing about them. The value of r can be rounded from .209 to .210 as we generally use two digits in reporting r. We can summarize this in an X→Y as Democratic Id .21 → RawMJ3.
    • The value of r2 is .044, which means that the variation in Democratic identification  explains  just over 4% of the variation in attitudes on recreational marijuana as measured by our index.
    • The second table reports information used in calculating the regression model. The most relevant bit for us is the overall significance of the model. At less than one in a thousand (.000) this is well below .05 indicating that the results of the regression model as a whole are very unlikely due to sampling error.
    • At a somewhat more technical level, the second table’s sum of squares column contains information on the total variance of the DV (1274.2) as well as the unexplained or residual variance (1218.7) not accounted for by the model.  Using the Total Variance – Unexplained Variance calculation explained in class, the Explained Variance = 55.5 (1274.2-1218.7 = 55.5). Moreover, the ratio of Explained Variance to Total Variance gives us the r2 value of .044 (55.5/1274.2=.044)
    • The third table provides the information needed to derive the regression equation. It contains  the constant and the b coefficient. Both are significant indicating that neither is likely due to sampling error. The equation using this information can be written in linear form as:
      y = a + b[x]. Hence: RawMJ3 = 1.108 + .734 (Democrat5)
    • The constant (1.108) tells us that when Democrat5 is scored as zero, support for RawMJ3 is approximately 1 on an index running from 0-3.
    • The regression coefficient is positive indicating that the relationship between the variables is positive. Thus as Democratic Identification increases, support for recreational Marijuana increases. Moreover, the equation tells us that for every one unit increase in Democrat5, RawMJ3 increases .734 units. The units referred to here are those in which each variable is measured.
    •  In a univariate regression, it is rare for the regression coefficient to not be significant when the overall model is significant.
    • The magnitude of the beta value can be overlooked for now. It will become relevant when we have two or more independent variables using multivariate regression. Notice that the regression equation above incorporates the b, not the beta.
    • The interpretation of the second equation is up to you.

Output for Second Regression

Model Summary
Model R R Square Adjusted R Square Std. Error of the Estimate
1 .361a .130 .129 1.07650
a. Predictors: (Constant), liberal5
ANOVAa
Model Sum of Squares df Mean Square F Sig.
1 Regression 169.650 1 169.650 146.395 .000b
Residual 1134.606 979 1.159
Total 1304.256 980
a. Dependent Variable: RawMJ3
b. Predictors: (Constant), liberal5
Coefficientsa
Model Unstandardized Coefficients Standardized Coefficients t Sig.
B Std. Error Beta
1 (Constant) .832 .067 12.476 .000
liberal5 1.351 .112 .361 12.099 .000
a. Dependent Variable: RawMJ3

INSTRUCTIONS

  1. In regression analysis, as with every other method of explanatory analysis, we begin by hypothesizing a relationship between an independent and a dependent variable. For example, continuing to work with the PPIC data, we may hypothesize that income or political interest affects attitudes about marijuana.
  2. It is also essential to recode each variable and identify missing values for both variables based upon their respective frequency analysis.
  3. Run the appropriate regression syntax in SPSS.
  4. In viewing your output, consider first the ANOVA table to see whether the relationship meets the standards of statistical significance for the equation as a whole and for the independent variable.
  5. Next find the unstandardized coefficient (the column labeled “B”). This is the slope of the line and should be interpreted as the predicted effect on the dependent variable of a one unit increase in the value of the independent variable. Be sure to note the direction of the relationship.
  6. Then check to see whether we can be confident that the results are not due to chance by checking the significance of that coefficient.
  7. Finally, assess the magnitude of the r-square to determine the percent of variance in the DV explained by variation in the IV. The multiple r is equivalent to the correlation coefficient.
  8. Repeat the analysis using another independent variable from the data set such as Finances.
  9. Write out the regression equation for the relationship and interpret the meaning of your results in terms of the effect of the independent variable on the dependent variable with reference to both b and r-square.

QUESTIONS FOR REFLECTION

  • Do the magnitude of the regression coefficient b and the constant depend on how your variables are coded?
  • What is the regression equation for your results and what is the meaning of each of the components?
  • How do we visualize regression results?

DISCUSSION

  • The value of the b coefficient and the constant do depend on how the variables are coded.  For example, recoding the variables into categories both b values and r-square will be affected. Generally speaking, when we use regression it is often preferable to use variables with as much of their original variation as possible.
  • It is up to you to calculate the effect of ideology or income and decide whether they more or less important than partisan identification.
  • Regression results can be visualized in two ways.
    • The first way is to run a Graph command in SPSS.
        • The relevant syntax is:
    • GRAPH /scatterplot = IV with DV.

Note that the independent variable appears first  in this procedure.

  • An alternative approach is to add a scatterplot subcommand to an existing regression procedure, after the /method= enter subcommand.
    • It takes the form:
      • /scatterplot = (DV IV).

Note the DV precedes the IV for this.

In either case a regression line can be added by double-clicking on the scatterplot which opens a Chart Editor page in SPSS. Immediately above the scatterplot, the fifth symbol from the left adds a fit line. (Hover your cursor over the symbols until you find the right one.)

Especially when working with categorical variables interested students may wish to create a “jittered scatter plot” by asking SPSS to add a small random number to each data value To do so, compute and use new “jittered” variables to visualize your results. For instance:

COMPUTE democrat5j = democrat5 + RV.UNIFORM(-0.15, +0.15).
COMPUTE RawMJ3j = RawMJ3 + RV.UNIFORM(-0.15, +0.15).

Note that the amount of jitter can be adjusted by altering the -/+ figures.