new Lab 15 | Data Art

Poli 101 LAB 15

Pearson’s Correlation

PURPOSE

To learn the meaning, use and interpretation of Pearson’s Correlation
Discover how to calculate the amount of variation in that can be explained.

MAIN POINTS

Pearson’s Correlation Coefficient (r) is designed to measure of the strength of a relationship between two interval variables. Pearson’s r can also be used to measure the strength of relationships between indicators measured at the ordinal level as long as they are coded appropriately, such that they run from low to high values with more or less equal & incremental categories.
As in the case of Kendall’s Tau, the sign of r indicates the direction of the relationship.
- A positive sign means that as the first variable increases in value so too does the second variable.
- A negative sign indicates that as the first increases in value, the second variable decreases (or vice versa).
The table below provides rough standards for how to evaluate the strength of the relationship for absolute values of r.
- The closer the value of r is to the absolute value of 1, the stronger the relationship between the two variables. When r is close to zero, the relationship is very weak.
- When evaluating intermediate values of r consider whether you are using public opinion data (like a PPIC or election study data set) or aggregate data (like worlddata.sav). Both are available for download under the Data menu on our course website. Relationships between public opinion variables, especially with ordinal level data, tend to register lower r coefficients than aggregate level data. As a result, two sets of standards are provided: one (blue table) for public opinion data and one (green table) for aggregate data.
A very useful aspect of Pearson’s r is that it allows us to measure the amount of explanatory power that the independent variable has regarding variation in the dependent variable. More specifically, the value of r² indicates the proportion of variation in the dependent variable that is explained by the variation in the independent variable. For example, if r= .35, then r²=(.35)²=.1225. In other words, the variation in the independent variable explains 12.25% of the variation in the dependent variable.
It is possible to compare r² values to one another to determine which relationship has the greatest explanatory power. As always care should be taken to ensure that missing values are properly handled, as their inclusion can substantially reduce correlations.

STANDARDS (FOR PUBLIC OPINION DATA)

ASSOCIATION MAGNITUDE	QUALIFICATION	COMMENTS
0.00	No Relationship	Knowing the independent variable does not at all explain variation in the dependent variable.
.00 to .15	Not Useful	Not Acceptable
.15 to .20	Very Weak	Minimally acceptable
.20 to .25	Moderately Strong	Acceptable
.25 to .30	Fairly Strong	Good Work
.30 to .40	Strong	Great Work
.40 to .60	Very Strong/Worrisomely Strong	Either an excellent relationship OR the two variables are measuring the same thing
.60 to .99	Redundant (?)	Proceed with caution: are the two variables measuring the same thing?
1.00	Perfect Relationship.	If we know the independent variable, we can predict the dependent variable with absolute success.

STANDARDS (FOR AGGREGATE DATA)

ASSOCIATION MAGNITUDE	QUALIFICATION	COMMENTS
0.00	No Relationship	Knowing the independent variable does not at all explain variation in the dependent variable.
.00 to .30	Not Useful, very weak	Not Acceptable
.30 to .50	Weak	Minimally acceptable
.50 to .70	Fairly Strong	Acceptable
.70 to .85	Strong	Good Work
.80 to .90	Very Strong/Worrisomely Strong	Either an excellent relationship OR the two variables are measuring the same thing
.90 to .99	Redundant (?)	Proceed with caution: are the two variables measuring the same thing?
1.00	Perfect Relationship.	If we know the independent variable, we can predict the dependent variable with absolute success.

EXAMPLE

Dataset:
- PPIC October 2016
Dependent Variable:Index of Support for Recreational Marijuana (RawMJ3)
- - Index of Support for Recreational Marijuana (7 categories; Alpha =.777)
    - Indicators: q21 recoded as MJPropD
    - q36 recoded as MJLegalD
    - q36a recoded as MJTry.

Independent Variables:
- Partisan Identification
- Political Ideology
Hypotheses Arrow Diagrams:
- H1: Democratic Party ID → Support Recreational Marijuana (5 categories coded 0-3)
- H2: Liberal Ideology → Support for Recreational Marijuana (7 categories coded 0-3)

Syntax

*Weighting the Data*.
weight by weight.
*Recoding MJ Index Items*.
recode q21 (1=1) (2=0) into MJPropD.
value labels MJPropD 1 'yes' 0 'no'.
recode q36 (1=1) (2=0) into MJLegalD.
value labels MJLegalD 1 'yes' 0 'no'.
recode q36a (1=1) (2=.5) (3=.0) into MJTry.
value labels MJTry 1 'recent' .5 'not recent' 0 'no'.

*Constructing an Index with alpha = .777*.
compute RawMJ3 = (MJPropD + MJLegalD + MJTry).

*Creating IV Indicators of Party Identification & Ideology*.
recode q40c (1=0) (3=.5) (2=1) into Democrat.
value labels Democrat 1 'Democ' .5 'Indep' 0 'Repub'.

*Democrat5 (adapted from from lab 7)*.
if (q40c = 1) and (q40e =1) Democrat5 =0.
if (q40c = 1) and (q40e =2) Democrat5 =.25.
if (q40c = 3) Democrat5 =.5.
if (q40c =2) and (q40d =2) Democrat5 = .75.
if (q40c =2) and (q40d=1) Democrat5 =1.
value labels Democrat5 0 'strRep' .25 'Rep' .5 'Indep' .75 'Dem' 1 'strDem'.

recode q37 (1,2=1) (3=.5) (4,5= 0) into liberal.
value labels liberal 1 'liberal' .5 'middle' 0 'conserv'.

recode q37 (1=1) (2=.75) (3= .5 ) (4=.25) (5= 0) into liberal5. 
value labels liberal5 1 'vlib' .75 'liberal' .5 'middle' .25 'conserv' 0 'vcons'. 
correlations variables=RawMJ3 Democrat5 liberal5.

correlations variables=RawMJ3 Democrat Democrat5 liberal liberal5.

Syntax Legend
- Missing values and recodes are specified as usual
- Each correlation command lists the raw (un-recoded) DV followed by relevant IV
- Note that two versions of both IVs (Democrat and liberal) are created. The first version of each IV has three values as in recent labs. The second version of each IV has five values in order to increase their variation.

First Correlation Output

	RawMJ3	Democrat5	liberal5
RawMJ3	1	.209	.361
Democrat5	.209	1	.372
liberal5	.361	.372	1

Interpretation

The correlation matrix is symmetrical, meaning that each variable selected for the correlation analysis appears in both the rows and the columns. This leads to redundant figures above and below the diagonal.
The first correlation output contains only the DV and the five value version of each IV.
Looking at its first column, which represents our dependent variable, we can see the Pearson’s r
The sign of the coefficients are positive, meaning that more democratic or liberal a respondent is the more supportive he or she is of recreational marijuana.
Since this is public opinion data, we conclude that the first relationship just barely exceed the moderately strong criterion whereas the second is clearly in the strong range.
Since r-square = (.209)²=.044, we know that Democratic identification explains 4.4% of the variation in support for recreational marijuana in this data set. By comparison liberal ideology with a correlation of .361 accounts for 13% of the variation in the DV.
The second correlation command produces a somewhat bigger table. It enables us to compare three and five values versions of both IVs.

Second Correlation Output

Correlations
	RawMJ3	Democrat	Democrat5	liberal3	liberal5
RawMJ3	1	.181	.209	.322	.361
Democrat	.181	1	.949	.321	.331
Democrat5	.209	.949	1	.347	.372
liberal3	.322	.321	.347	1	.943
liberal5	.361	.331	.372	.943	1

The first column shows that the five value version of each IV is a somewhat stronger predictor of the DV than the relevant three value version.
We also can see that the three and five values versions of each IV are so strongly related to each other as to be virtually indistinguishable.
Both of the tables displayed above are edited from what is produced by SPSS. The full output is shown below.
In the full output we can see the Pearson’s r value at the top of each cell. Immediately below this is the significance level. And at the bottom of each cell is the sample size (N). For instance, we can see that for cell [RawMJ3 X Democrat] Pearson’s r= .1810, significance=.000, n=969.

Full Output

Correlations
		RawMJ3	Democrat	Democrat5	liberal3	liberal5
RawMJ3	Pearson Corr	1	.181	.209	.322	.361
	Sig. (2-tailed)		.000	.000	.000	.000
	N	994	969	963	981	981
Democrat	Pearson Corr	.181	1	.949	.321	.331
	Sig. (2-tailed)	.000		.000	.000	.000
	N	969	1614	1600	1580	1580
Democrat5	Pearson Corr	.209	.949	1	.347	.372
	Sig. (2-tailed)	.000	.000		.000	.000
	N	963	1600	1600	1567	1567
liberal3	Pearson Corr	.322	.321	.347	1	.943
	Sig. (2-tailed)	.000	.000	.000		.000
	N	981	1580	1567	1647	1647
liberal5	Pearson Corr	.361	.331	.372	.943	1
	Sig. (2-tailed)	.000	.000	.000	.000
	N	981	1580	1567	1647	1647

Try to rerun this analysis using the indicators with additional IVs. The syntax for some possible variables are included below so they need only be included on the correlations command.

Additional Syntax

*Additional IVs*.
*Demographics*.

recode d1a (1=0) (2= .2) (3= .4) (4=.6) (5=.8) (6=1) into age.
value labels age 0 '18+' .2 '25+' .4 '35+' .6 '45+' .8 '55+' 1 '65+'.

recode d6 (1=1) (2=.75) (3=.5) (4=.25) (5=0) into educ.
value labels educ 0 '<hs' .25 'hs' .5 'col' .75 'grad' 1 'post'.

recode d10 (1 =0) (2=.17) (3=.34) (4=.5) (5=.66) (6=.83) (7=1) into income.
value labels income 0 '<$20k' .17 '$20k+' .34 '$40k+' .5 '$60k+' .66 '$80k+' .83 '$100k+' 1 '$200k+' .

recode q38 (1=1) (2=.66) (3=.33) (4=0) into interest.
value labels interest 0 'none' .33 'only a little' .66 'fair amount' 1 'great deal'.

recode q39 (1=1) (2=.75) (3=.5) (4=.25) (5=0) into vote.
value labels vote 0 'never' .25 'seldom' .5 'part time' .75 'nearly' 1 'always'.

INSTRUCTIONS

Stage 1

Select any of the available datasets for the purpose of this exercise.
Hypothesize a relationship between two interval or ordinal variables in the data set. Although the correlation will only describe a relationship, please think about which is the dependent variable and which is the independent variable.
- For example, an individual’s attitude toward recreational marijuana uses [dependent variable] depends on his or her partisanship [independent variable].
Run the Frequency distribution for each of the variables. Based on the Frequency output, decide how to recodeeach variable (if necessary) and identify the missing values.
Use SPSS and select the chosen dataset.
Under the Analyze menu select “Correlate” or enter the appropriate syntax using the example from above.
Enter the dependent & independent variables in the appropriate boxes or in your syntax. Remember to also enter the missing values.
Add any recodes (if necessary) to you syntax and hit Run.
Judge whether the relationship meets the standards based on the magnitude of Pearson’s r value and also based on whether the significance is below .05.
Calculate the explanatory power of the independent variable over the dependent variable, that is, calculate r-square (r²).
Repeat the analysis until you find a set of variables with a relationship that has a correlation value that meets the standards above.
Compose a few sentences explaining your analysis and results.

Stage II

Hypothesize at least two other independent variables that may explain variation in the dependent variable. Examples are given below.
Run frequency distributions for each variable to determine recodes.
Edit your syntax to include the additional variables.
Note that the syntax to run the correlation is:
- correlation DependentVar IndepVar
  /statistics=all.
Add other variables to the correlation line to create a matrix showing the correlations for the combination of the variables entered.
Your syntax for the correlation should now be:
- correlation DependentVar IndepVar1 IndepVar2 IndepVar3
  /statistics=all.
Rerun your edited the syntax.
Repeat this stage until you have found at least three independent variables that have acceptable correlations with the dependent variable.

QUESTIONS FOR REFLECTION

How much more variation is explained by your 1^st highest ranked independent variable compared to your 2^nd highest ranked independent variable?
How does Pearson’s r differ from the other measures of association?
Does the value of Pearson’s r depend on which of the variables is the dependent variable and the independent variable?
How can you use correlation analysis to find relationships?
What are some good practical uses of correlational analysis?
Why are the standards for public opinion data different than standards for aggregate opinion data?

DISCUSSION

To determine how much more variation is explained by one independent variable compared to another, take the difference of the r-squared values, not the difference in r values. That is, calculate r_A² – r_B².
Unlike the other measures of association, Pearson’s R allows us to calculate the explanatory power of a relationship. To see the tau-b coefficients for the same matrix of variables use this syntax:
```
nonpar corr
    /variables= rawMJ3 Democrat5 liberal5
    /print kendall.
```
Like other measures of association, Pearson’s r only measures correlation, which is distinct from causation. So the Pearson’s r value will be the same whichever variable you identify as dependent or independent.
To blindly find strong relationships, simply plug in all the variables you think may be related to one another into a correlation matrix. Then look at the Pearson’s r in each of the cells to find out which variables are related to one another. Take care to remove or declare all missing values.
One practical use of a correlation matrix is to find variables that would make suitable indexes. By finding variables that have high Pearson’s r in the matrix you will have an idea of which variables will be suitable for an index. For example, you may find that variables A&B, B&C, and A&C are all strongly correlated to one another. Once you know that all these three variables are strongly correlated to one another, you can try including all 3 in a reliability-run to see whether they make a good index.
Also, another good use for Pearson’s correlation matrix is to find good independent variables to explain your dependent variable. This will be useful when we get to Multivariate Regression.

Advanced Analysis

Rather than using the multi-indicator index regarding recreational marijuana, on could use the following syntax (from New lab 8) to focus on intended vote by including the measure of issue importance (q22) in the analysis.

if (q21 =1) and (q22 =1) StrMJ = 1.
if (q21 =1) and (q22 =2) StrMJ = .86.
if (q21 =1) and (q22 =3) StrMJ = .71.
if (q21 =1) and (q22 =4) StrMJ = .57.
if (q21 =2) and (q22 =4) StrMJ = .43.
if (q21 =2) and (q22 =3) StrMJ = .29.
if (q21 =2) and (q22 =2) StrMJ = .14.
if (q21 =2) and (q22 =1) StrMJ = 0.

correlations 
   /variables=RawMJ3 StrMJ Democrat Democrat5 liberal liberal5.