new Lab 15
Poli 101 LAB 15
Pearson’s Correlation
PURPOSE
 To learn the meaning, use and interpretation of Pearson’s Correlation
 Discover how to calculate the amount of variation in that can be explained.
MAIN POINTS
 Pearson’s Correlation Coefficient (r) is designed to measure of the strength of a relationship between two interval variables. Pearson’s r can also be used to measure the strength of relationships between indicators measured at the ordinal level as long as they are coded appropriately, such that they run from low to high values with more or less equal & incremental categories.
 As in the case of Kendall’s Tau, the sign of r indicates the direction of the relationship.
 A positive sign means that as the first variable increases in value so too does the second variable.
 A negative sign indicates that as the first increases in value, the second variable decreases (or vice versa).
 The table below provides rough standards for how to evaluate the strength of the relationship for absolute values of r.
 The closer the value of r is to the absolute value of 1, the stronger the relationship between the two variables. When r is close to zero, the relationship is very weak.
 When evaluating intermediate values of r consider whether you are using public opinion data (like a PPIC or election study data set) or aggregate data (like worlddata.sav). Both are available for download under the Data menu on our course website. Relationships between public opinion variables, especially with ordinal level data, tend to register lower r coefficients than aggregate level data. As a result, two sets of standards are provided: one (blue table) for public opinion data and one (green table) for aggregate data.
 A very useful aspect of Pearson’s r is that it allows us to measure the amount of explanatory power that the independent variable has regarding variation in the dependent variable. More specifically, the value of r^{2} indicates the proportion of variation in the dependent variable that is explained by the variation in the independent variable. For example, if r= .35, then r^{2}=(.35)^{2}=.1225. In other words, the variation in the independent variable explains 12.25% of the variation in the dependent variable.
 It is possible to compare r^{2} values to one another to determine which relationship has the greatest explanatory power. As always care should be taken to ensure that missing values are properly handled, as their inclusion can substantially reduce correlations.
STANDARDS (FOR PUBLIC OPINION DATA)
MAGNITUDE OF ASSOCIATION  QUALIFICATION  COMMENTS 
0.00  No Relationship  Knowing the independent variable does not at all explain variation in the dependent variable. 
.00 to .15  Not Useful  Not Acceptable 
.15 to .20  Very Weak  Minimally acceptable 
.20 to .25  Moderately Strong  Acceptable 
.25 to .30  Fairly Strong  Good Work 
.30 to .40  Strong  Great Work 
.40 to .60  Very Strong/Worrisomely Strong  Either an excellent relationship OR the two variables are measuring the same thing 
.60 to .99  Redundant (?)  Proceed with caution: are the two variables measuring the same thing? 
1.00  Perfect Relationship.  If we know the independent variable, we can predict the dependent variable with absolute success. 
STANDARDS (FOR AGGREGATE DATA)
MAGNITUDE OF ASSOCIATION  QUALIFICATION  COMMENTS 
0.00  No Relationship  Knowing the independent variable does not at all explain variation in the dependent variable. 
.00 to .30  Not Useful, very weak  Not Acceptable 
.30 to .50  Weak  Minimally acceptable 
.50 to .70  Fairly Strong  Acceptable 
.70 to .85  Strong  Good Work 
.80 to .90  Very Strong/Worrisomely Strong  Either an excellent relationship OR the two variables are measuring the same thing 
.90 to .99  Redundant (?)  Proceed with caution: are the two variables measuring the same thing? 
1.00  Perfect Relationship.  If we know the independent variable, we can predict the dependent variable with absolute success. 
EXAMPLE
 Dataset:
 PPIC October 2016
 Dependent Variable:Index of Support for Recreational Marijuana (RawMJ3)
 Index of Support for Recreational Marijuana (7 categories; Alpha =.777)
 Indicators: q21 recoded as MJPropD
 q36 recoded as MJLegalD
 q36a recoded as MJTry.
 Index of Support for Recreational Marijuana (7 categories; Alpha =.777)
 Independent Variables:
 Partisan Identification
 Political Ideology
 Hypotheses Arrow Diagrams:
 H1: Democratic Party ID → Support Recreational Marijuana (5 categories coded 03)
 H2: Liberal Ideology → Support for Recreational Marijuana (7 categories coded 03)
 Syntax
*Weighting the Data*. weight by weight. *Recoding MJ Index Items*. recode q21 (1=1) (2=0) into MJPropD. value labels MJPropD 1 'yes' 0 'no'. recode q36 (1=1) (2=0) into MJLegalD. value labels MJLegalD 1 'yes' 0 'no'. recode q36a (1=1) (2=.5) (3=.0) into MJTry. value labels MJTry 1 'recent' .5 'not recent' 0 'no'. *Constructing an Index with alpha = .777*. compute RawMJ3 = (MJPropD + MJLegalD + MJTry). *Creating IV Indicators of Party Identification & Ideology*. recode q40c (1=0) (3=.5) (2=1) into Democrat. value labels Democrat 1 'Democ' .5 'Indep' 0 'Repub'.
*Democrat5 (adapted from from lab 7)*. if (q40c = 1) and (q40e =1) Democrat5 =0. if (q40c = 1) and (q40e =2) Democrat5 =.25. if (q40c = 3) Democrat5 =.5. if (q40c =2) and (q40d =2) Democrat5 = .75. if (q40c =2) and (q40d=1) Democrat5 =1. value labels Democrat5 0 'strRep' .25 'Rep' .5 'Indep' .75 'Dem' 1 'strDem'.
recode q37 (1,2=1) (3=.5) (4,5= 0) into liberal. value labels liberal 1 'liberal' .5 'middle' 0 'conserv'.
recode q37 (1=1) (2=.75) (3= .5 ) (4=.25) (5= 0) into liberal5. value labels liberal5 1 'vlib' .75 'liberal' .5 'middle' .25 'conserv' 0 'vcons'. correlations variables=RawMJ3 Democrat5 liberal5.
correlations variables=RawMJ3 Democrat Democrat5 liberal liberal5.
 Syntax Legend
 Missing values and recodes are specified as usual
 Each crosstab command lists the raw (unrecoded) DV followed by relevant IV
 Note that two versions of both IVs (Democrat and liberal) are created. The first version of each IV has three values as in recent labs. The second version of each IV has five values in order to increase their variation.
First Correlation Output
RawMJ3  Democrat5  liberal5  
RawMJ3  1  .209  .361  
Democrat5  .209  1  .372  
liberal5  .361  .372  1 
Interpretation
 The correlation matrix is symmetrical, meaning that each variable selected for the correlation analysis appears in both the rows and the columns. This leads to redundant figures above and below the diagonal.
 The first correlation output contains only the DV and the five value version of each IV.
 Looking at its first column, which represents our dependent variable, we can see the Pearson’s r
 The sign of the coefficients are positive, meaning that more democratic or liberal a respondent is the more supportive he or she is of recreational marijuana.
 Since this is public opinion data, we conclude that the first relationship just barely exceed the moderately strong criterion whereas the second is clearly in the strong range.
 Since rsquare = (.209)^{2}=.044, we know that Democratic identification explains 4.4% of the variation in support for recreational marijuana in this data set. By comparison liberal ideology with a correlation of .361 accounts for 13% of the variation in the DV.
 The second correlation command produces a somewhat bigger table. It enables us to compare three and five values versions of both IVs.Second Correlation Output
Correlations RawMJ3 Democrat Democrat5 liberal3 liberal5 RawMJ3 1 .181 .209 .322 .361 Democrat .181 1 .949 .321 .331 Democrat5 .209 .949 1 .347 .372 liberal3 .322 .321 .347 1 .943 liberal5 .361 .331 .372 .943 1  The first column shows that the five value version of each IV is a somewhat stronger predictor of the DV than the relevant three value version.
 We also can see that the three and five values versions of each IV are so strongly related to each other as to be virtually indistinguishable.
 Both of the tables displayed above are edited from what is produced by SPSS. The full output is shown below.
 In the full output we can see the Pearson’s r value at the top of each cell. Immediately below this is the significance level. And at the bottom of each cell is the sample size (N). For instance, we can see that for cell [RawMJ3 X Democrat5] Pearson’s r= .1810, significance=.000, n=969.
Full Output

Correlations RawMJ3 Democrat Democrat5 liberal3 liberal5 RawMJ3 Pearson Corr 1 .181 .209 .322 .361 Sig. (2tailed) .000 .000 .000 .000 N 994 969 963 981 981 Democrat Pearson Corr .181 1 .949 .321 .331 Sig. (2tailed) .000 .000 .000 .000 N 969 1614 1600 1580 1580 Democrat5 Pearson Corr .209 .949 1 .347 .372 Sig. (2tailed) .000 .000 .000 .000 N 963 1600 1600 1567 1567 liberal3 Pearson Corr .322 .321 .347 1 .943 Sig. (2tailed) .000 .000 .000 .000 N 981 1580 1567 1647 1647 liberal5 Pearson Corr .361 .331 .372 .943 1 Sig. (2tailed) .000 .000 .000 .000 N 981 1580 1567 1647 1647  Try to rerun this analysis using the indicators with additional IVs. These syntax for some possible variables are included below so they need only be included on the correlations command.
Additional Syntax*Additional IVs*. *Demographics*.
recode d1a (1=0) (2= .2) (3= .4) (4=.6) (5=.8) (6=1) into age. value labels age 0 '18+' .2 '25+' .4 '35+' .6 '45+' .8 '55+' 1 '65+'. recode d6 (1=1) (2=.75) (3=.5) (4=.25) (5=0) into educ. value labels educ 0 '<hs' .25 'hs' .5 'col' .75 'grad' 1 'post'. recode d10 (1 =0) (2=.17) (3=.34) (4=.5) (5=.66) (6=.83) (7=1) into income. value labels income 0 '<$20k' .17 '$20k+' .34 '$40k+' .5 '$60k+' .66 '$80k+' .83 '$100k+' 1 '$200k+' . recode q38 (1=1) (2=.66) (3=.33) (4=0) into interest. value labels interest 0 'none' .33 'only a little' .66 'fair amount' 1 'great deal'. recode q39 (1=1) (2=.75) (3=.5) (4=.25) (5=0) into vote. value labels vote 0 'never' .25 'seldom' .5 'part time' .75 'nearly' 1 'always'.
INSTRUCTIONS
Stage 1
 Select any of the available datasets for the purpose of this exercise.
 Hypothesize a relationship between two interval or ordinal variables in the data set. Although the correlation will only describe a relationship, please think about which is the dependent variable and which is the independent variable.
 For example, an individual’s attitude toward recreational marijuana uses [dependent variable] depends on his or her partisanship [independent variable].
 Run the Frequency distribution for each of the variables. Based on the Frequency output, decide how to recodeeach variable (if necessary) and identify the missing values.
 Use SPSS and select the chosen dataset.
 Under the Analyze menu select “Correlate” or enter the appropriate syntax using the example from above.
 Enter the dependent & independent variables in the appropriate boxes or in your syntax. Remember to also enter the missing values.
 Add any recodes (if necessary) to you syntax and hit Run.
 Judge whether the relationship meets the standards based on the magnitude of Pearson’s r value and also based on whether the significance is below .05.
 Calculate the explanatory power of the independent variable over the dependent variable, that is, calculate rsquare (r^{2}).
 Repeat the analysis until you find a set of variables with a relationship that has a correlation value that meets the standards above.
 Compose a few sentences explaining your analysis and results.
Stage II
 Hypothesize at least two other independent variables that may explain variation in the dependent variable. Examples are given below.
 Run frequency distributions for each variable to determine recodes.
 Edit your syntax to include the additional variables.
 Note that the syntax to run the correlation is:
 correlation DependentVar IndepVar
/statistics=all.
 correlation DependentVar IndepVar
 Add other variables to the correlation line to create a matrix showing the correlations for the combination of the variables entered.
 Your syntax for the correlation should now be:
 correlation DependentVar IndepVar1 IndepVar2 IndepVar3
/statistics=all.
 correlation DependentVar IndepVar1 IndepVar2 IndepVar3
 Rerun your edited the syntax.
 Repeat this stage until you have found at least three independent variables that have acceptable correlations with the dependent variable.
QUESTIONS FOR REFLECTION
 How much more variation is explained by your 1^{st} highest ranked independent variable compared to your 2^{nd} highest ranked independent variable?
 How does Pearson’s r differ from the other measures of association?
 Does the value of Pearson’s r depend on which of the variables is the dependent variable and the independent variable?
 How can you use correlation analysis to find relationships?
 What are some good practical uses of correlational analysis?
 Why are the standards for public opinion data different than standards for aggregate opinion data?
DISCUSSION
 To determine how much more variation is explained by one independent variable compared to another, take the difference of the rsquared values, not the difference in r values. That is, calculate r_{A}^{2} – r_{B}^{2}.
 Unlike the other measures of association, Pearson’s R allows us to calculate the explanatory power of a relationship. To see the taub coefficients for the same matrix of variables use this syntax:
nonpar corr /variables= rawMJ3 Democrat5 liberal5 /print kendall.
 Like other measures of association, Pearson’s r only measures correlation, which is distinct from causation. So the Pearson’s r value will be the same whichever variable you identify as dependent or independent.
 To blindly find strong relationships, simply plug in all the variables you think may be related to one another into a correlation matrix. Then look at the Pearson’s r in each of the cells to find out which variables are related to one another. Take care to remove declare all missing values.
 One practical use of a correlation matrix is to find variables that would make suitable indexes. By finding variables that have high Pearson’s r in the matrix you will have an idea of which variables will be suitable for an index. For example, you may find that variables A&B, B&C, and A&C are all strongly correlated to one another. Once you know that all these three variables are strongly correlated to one another, you can try including all 3 in a reliabilityrun to see whether they make a good index.
 Also, another good use for Pearson’s correlation matrix is to find good independent variables to explain your dependent variable. This will be useful when we get to Multivariate Regression.
Advanced Analysis
Rather than using the multiindicator index regarding recreational marijuana, on could use the following syntax (from New lab 8) to focus on intended vote by including the measure of issue importance (q22) in the analysis.
if (q21 =1) and (q22 =1) StrMJ = 1.
if (q21 =1) and (q22 =2) StrMJ = .86.
if (q21 =1) and (q22 =3) StrMJ = .71.
if (q21 =1) and (q22 =4) StrMJ = .57.
if (q21 =2) and (q22 =4) StrMJ = .43.
if (q21 =2) and (q22 =3) StrMJ = .29.
if (q21 =2) and (q22 =2) StrMJ = .14.
if (q21 =2) and (q22 =1) StrMJ = 0.
correlations /variables=RawMJ3 StrMJ Democrat Democrat5 liberal liberal5.