Lab 15

POL242 LAB 15

Pearson’s Correlation

PURPOSE

  • To learn the meaning, use and interpretation of Pearson’s Correlation
  • Discover how to calculate the amount of variation in that can be explained.

MAIN POINTS

  • Pearson’s Correlation Coefficient (r) is designed to measure of the strength of a relationship between two interval variables.  Pearson’s r can also be used to measure the strength of relationships between indicators measured at the ordinal level as long as they are coded appropriately, such that they run from low to high values with more or less equal & incremental categories.
  • As in the case of Kendall’s Tau, the sign of r indicates the direction of the relationship.
    • A positive sign means that as the first variable increases in value so too does the second variable.
    • A negative sign indicates that as the first increases in value, the second variable decreases (or vice versa).
  • The table below provides rough standards for how to evaluate the strength of the relationship for absolute values of r.
    • The closer the value of is to the absolute value of 1, the stronger the relationship between the two variables. When is close to zero, the relationship is very weak.
    • When evaluating intermediate values of consider whether you are using public opinion data (like an election study data set) or aggregate data (like worlddata.sav, available for download under the Data menu on our course website. Relationships between public opinion variables, especially with ordinal level data, tend to register lower coefficients than aggregate level data. As a result, we provide two sets of standards: one (blue table) for public opinion data and one (green table) for aggregate data.
  • A very useful aspect of Pearson’s r is that it allows us to measure the amount of explanatory power that the independent variable has over the dependent variable.  The value of r2 indicates the proportion of variation in the dependent variable that is explained by the variation in the independent variable.  For example, if r= .35, then r2=(.35)2=.1225.  In other words, the variation in the independent variable explains 12.25% of the variation in the dependent variable.
  • It is possible to compare r2 values to one another to determine which relationship has the greatest explanatory power. As always care should be taken to ensure that missing values are properly handled, as their inclusion can substantially reduce correlations.

STANDARDS (FOR PUBLIC OPINION DATA)

MAGNITUDE OF ASSOCIATION QUALIFICATION COMMENTS
0.00 No Relationship Knowing the independent variable does not reduce the number of errors in predicting the dependent variable at all.
.00 to .15 Not Useful Not Acceptable
.15 to .20 Very Weak  Minimally acceptable
.20 to .25 Moderately Strong  Acceptable
.25 to .30 Fairly Strong  Good Work
.30 to .40 Strong  Great Work
.40 to .60 Very Strong/Worrisomely Strong Either an excellent relationship OR the two variables are measuring the same thing
.60 to .99 Redundant (?) Proceed with caution: are the two variables measuring the same thing?
1.00 Perfect Relationship.  If we know the independent variable, we can predict the dependent variable with absolute success. 

 

STANDARDS (FOR AGGREGATE DATA)

MAGNITUDE OF ASSOCIATION QUALIFICATION COMMENTS
0.00 No Relationship Knowing the independent variable does not reduce the number of errors in predicting the dependent variable at all.
.00 to .30 Not Useful, very weak Not Acceptable
.30 to .50 Weak  Minimally acceptable
.50 to .70 Fairly Strong  Acceptable
.70 to .85 Strong  Good Work
.80 to .90 Very Strong/Worrisomely Strong Either an excellent relationship OR the two variables are measuring the same thing
.90 to .99 Redundant (?) Proceed with caution: are the two variables measuring the same thing?
1.00 Perfect Relationship.  If we know the independent variable, we can predict the dependent variable with absolute success. 

EXAMPLE

  • Dataset:
    • CES2011
  • Dependent Variable:
      • Egal (Alpha .67)
        • Indicators: PES11_41; mbs11_k2; mbs11_b3.
  • Independent Variables:
      • Conserv; Lang, Age, Female, Educ, Inc, Finances, Econ

Hypothesis Arrow Diagram:

    • Conserv–> Egal
    • Inc –>Egal
    • Finances –> Egal
    • Econ –> Egal
  • Syntax
weight by WGTSamp.
*Preparing indicators of Attitudes re Inequality*.
*declare missing values on pes11_41*.
missing values pes11_41 (8,9).

*reverse scoring on pes11_41 and make it range from 0-1*.
recode PES11_41 (1=1) (2=.75) (3=.5) (4= .25)
  (5=0) into undogap.
value labels undogap 0 'muchless' .25 'someless' .5 'asnow'
   .75 'somemore' 1 'muchmore'.

*rescale mbs11_k2 from 0-10 to 0-1 and reverse its scoring*.
missing values mbs11_k2 (-99).

compute govact = (((mbs11_k2 * -1) +10)/10).
value labels govact 0'not act' 1 'gov act'.

*recode and re-label mbs11_b3 and pes11_52b*.
recode mbs11_b3 (1=1) (2=0) into goveqch.
value labels goveqch 1 'decent living' 0 'leave alone'.

*create an indexed variable (alpha=.66).
compute rawegal = undogap + govact + goveqch.
fre var = rawegal.

*Preparing IV indicator-party identification*.
recode cps11_71 (2=1) (1=2) (4=3) (3=4)into PID4.
value labels PID4 1 'Cons' 2 'Lib' 3 'BQ' 4 'NDP'.

*dichotomize PID IV*.
recode PID4 (1=1) (2,3,4 = 0) into PID2.
value labels PID2 1 'Conserv' 0 'Other'.

*interval measure of partisan feeling from Lab 7*.
fre var cps11_18.
recode cps11_18 (0=0) (else = copy) into ConFeel .
missing values Confeel (996, 998, 999).
fre var Confeel.

*create finance measures (from Lab 7.
missing values cps11_66 (8,9).
recode cps11_66 (1=1) (3=0) (5=.5) into finances.
variable labels finances 'personal finances'.
value labels finances 0 'worse' .5 'same' 1 'better'.

missing values cps11_39 (8,9).
recode cps11_39 (1=1) (3=0) (5=.5) into Econ.
value labels Econ 0 'worse' .5 'same' 1 'better'.

*Correlational Analysis*.
Corr variables = rawegal ConFeel PID2 finances Econ.
  • Output
Correlations
rawegal ConFeel PID2 personal finances Econ
rawegal Correlation 1.00 -.384 -.400 -.241 -.209
Sig.  —   .000   .000   .000   .000
N   851   832   628   850  843
ConFeel Correlation -.384 1.00 .615 .182 .218
Sig. .000  — .000 .000 .000
N 832 3252 2301 3229 3191
PID2 Correlation -.400 .615 1.00 .167 .245
Sig. .000 .000  — .000 .000
N 628 2301 2364 2350 2326
personal finances Correlation -.241 .182 .167 1.00 .291
Sig. .000 .000 .000  — .000
N 850 3229 2350 3436  3361
Econ Correlation -.209 .218 .245 .291 1.00
Sig. .000 .000 .000 .000  —
N 843 3191 2326 3361 3382

 

  • Interpretation
    • The correlation matrix is symmetrical, meaning that each variable selected for the correlation analysis appears in both the rows and the columns.  This leads to redundant figures above and below the diagonal.
    • Looking at the first column, which represents our dependent variable, we can see the Pearson’s r value at the top of each cell.  In the middle of the cell is the significance level. At the bottom of each cell is the sample size (N).  For instance, we can see that for cell [rawegal X personal finances] Pearson’s r= -.241,  significance=.000, n=850.
    • The sign of the coefficient is neagtive, meaning that the better one’s personal finances the less likely they are to endorse egalitarian attitudes.
    • Since this is public opinion data, we conclude that there is a moderately strong inverse relationship between personal finances and egalitarian attitudes
    • Since r-square = (-.241)2=.058, we know that personal finances explains 5.8% of the variation in egalitarian attitudes in this data set.
    • The second column or row of the table shows the relationship between feelings toard the Conservatives and identifying as a Conservative is 615, which is high enough to suggest that these two variables are actually measuring the same concept. When you use your own variables, you should use theory and your knowledge of the world to decide whether the relationship is too high- or just an excellent predictor.

 

INSTRUCTIONS

Stage I

  1. Select any of the available datasets for the purpose of this exercise.
  2. Hypothesize a relationship between two interval or ordinal variables in the data set.  Although the correlation will only describe a relationship, please think about which is the dependent variable and which is the independent variable.
    • For example, an individual’s attitude toward egalitarianism [dependent variable] depends on his or her partisanship [independent variable].
  3. Run the Frequency distribution for each of the variables. Based on the Frequency output, decide how to recodeeach variable (if necessary) and identify the missing values.
  4. Use SPSS and select the chosen dataset.
  5. Select “Correlation” or enter the appropriate syntax using the example from above.
  6. Enter the dependent & independent variables in the entry boxes for ‘Step 1’.  Remember to also enter the missing values.
  7. Enter any recodes (if necessary) in ‘Step 3’ and hit Run.
  8. Judge whether the relationship meets the standards based on the magnitude of Pearson’s r value and also based on whether the significance is below .05.
  9. Calculate the explanatory power of the independent variable over the dependent variable.  That is, calculate r-square(r2).
  10. Repeat the analysis until you find a set of variables with a relationship that has a correlation value that meets the standards above.
  11. Compose a few sentences explaining your analysis and results.

Stage II

  • Hypothesize at least two other independent variables that may explain variation in the dependent variable. Examples are given below.
  • Run frequency distributions for each variable to determine recodes.
  • Edit your syntax to include the additional variables.
  • Note that the syntax to run the correlation is:
    • correlation DependentVar IndepVar
      /statistics=all.
  • Add other variables to the correlation line to create a matrix showing the correlations for the combination of the variables entered.
  • Your syntax for the correlation should now be:
    • correlation DependentVar IndepVar1 IndepVar2 IndepVar3 
      /statistics=all.
  • Rerun your edited the syntax.
  • Repeat this stage until you have found at least three independent variables that have acceptable correlations with the dependent variable.

Additional Syntax

*Some additional IV indicators (from Lab 1).*
*create language grouip indicator*.
recode cps_intlang11 (1=0) (5=1) into french.

*Create Age indicator*.
missing values cps11_78 (9998, 9999).
compute age = (2011- cps11_78 ).

*Create gender indicator*.
recode rgender11 (1=0) (5=1) into female.

*Create education measure*.
recode cps11_79 ( 1=1) (else = copy) into educ.
value labels educ 1 ' no sch' 2 ' some elem' 3 'compl elem' 4 'some sec'
  5 'compl sec' 6 'some tech cc' 7 'compl tech cc' 8 'some univ'
  9 ' bach' 10 ' mast' 11 'prof doc' 98 'dk' 99 'ref'.
missing values educ (98,99).
recode educ (1 thru 4 = 1) (5=2) (6,7,8=3) (9=4) (10,11=5) into educ5.
value labels educ5 1 '<sec' 2 'sec' 3 'some post sec' 4 'post sec' 5 'grad sch'.

*Create Income measure*.
missing values cps11_92 (998, 999).
missing values cps11_93 (98,99).
numeric income.
if (cps11_93 =1) or (cps11_92 lt 30) income = 1.
if (cps11_93 =2) or ((cps11_92 ge 30) and (cps11_92 lt 60)) income = 2.
if (cps11_93 =3) or ((cps11_92 ge 60) and (cps11_92 lt 90)) income = 3.
if (cps11_93 =4) or ((cps11_92 ge 90) and (cps11_92 lt 110)) income = 4.
if (cps11_93 =5) or (cps11_92 ge 110) income = 5.
value labels income 1 '<$30k' 2 '$30k-$59k' 3 '$60k-$89k' 4 '$90k-$109k'
   5 '$110k+'.

QUESTIONS FOR REFLECTION

  • How much more variation is explained by your 1st highest ranked independent variable compared to your 2nd highest ranked independent variable?
  • How does Pearson’s r differ from the other measures of association?
  • Does the value of Pearson’s r depend on which of the variables is the dependent variable and the independent variable?
  • How can you use correlation analysis to find relationships?
  • What are some good practical uses of correlational analysis?
  • Why are the standards for public opinion data different than standards for aggregate opinion data?

DISCUSSION

  • To determine how much more variation is explained by one independent variable compared to another, take the difference of the r-squared values, not the difference in r values.  That is, calculate rA2 – rB2.
  • Unlike the other measures of association, Pearson’s R allows us to calculate the explanatory power of a relationship. To see the tau-b coefficients for the same matrix of variables use this syntax:
    nonpar corr
        /variables=rawegal ConFeel PID2 finances Econ
        /print kendall.
  • Like other measures of association, Pearson’s r only measures correlation, which is distinct from causation.  So the Pearson’s r value will be the same whichever variable you identify as dependent or independent.
  • To blindly find strong relationships, simply plug in all the variables you think may be related to one another into a correlation matrix.  Then look at the Pearson’s r in each of the cells to find out which variables are related to one another.
  • One practical use of a correlation matrix is to find variables that would make suitable indexes.  By finding variables that have high Pearson’s r in the matrix you will have an idea of which variables will be suitable for an index.  For example, you may find that variables A&B, B&C, and A&C are all strongly correlated to one another.  Once you know that all these three variables are strongly correlated to one another, you can try including all 3 in a reliability-run to see whether they make a good index.
  • Also, another good use for Pearson’s correlation matrix is to find good independent variables to explain your dependent variable.  This will be useful when we get to Multivariate Regression.