UCSC Lab 15 | Data Art

Poli 101 LAB 15

Pearson’s Correlation

PURPOSE

To learn the meaning, use and interpretation of Pearson’s Correlation
Discover how to calculate the amount of variation in that can be explained.

MAIN POINTS

Pearson’s Correlation Coefficient (r) is designed to measure of the strength of a relationship between two interval variables. Pearson’s r can also be used to measure the strength of relationships between indicators measured at the ordinal level as long as they are coded appropriately, such that they run from low to high values with more or less equal & incremental categories.
As in the case of Kendall’s Tau, the sign of r indicates the direction of the relationship.
- A positive sign means that as the first variable increases in value so too does the second variable.
- A negative sign indicates that as the first increases in value, the second variable decreases (or vice versa).
The table below provides rough standards for how to evaluate the strength of the relationship for absolute values of r.
- The closer the value of r is to the absolute value of 1, the stronger the relationship between the two variables. When r is close to zero, the relationship is very weak.
- When evaluating intermediate values of r consider whether you are using public opinion data (like a PPIC or election study data set) or aggregate data (like worlddata.sav). Both are available for download under the Data menu on our course website. Relationships between public opinion variables, especially with ordinal level data, tend to register lower r coefficients than aggregate level data. As a result, two sets of standards are provided: one (blue table) for public opinion data and one (green table) for aggregate data.
A very useful aspect of Pearson’s r is that it allows us to measure the amount of explanatory power that the independent variable has regarding variation in the dependent variable. More specifically, the value of r² indicates the proportion of variation in the dependent variable that is explained by the variation in the independent variable. For example, if r= .35, then r²=(.35)²=.1225. In other words, the variation in the independent variable explains 12.25% of the variation in the dependent variable.
It is possible to compare r² values to one another to determine which relationship has the greatest explanatory power. As always care should be taken to ensure that missing values are properly handled, as their inclusion can substantially reduce correlations.

STANDARDS (FOR PUBLIC OPINION DATA)

MAGNITUDE OF ASSOCIATION	QUALIFICATION	COMMENTS
0.00	No Relationship	Knowing the independent variable does not at all explain variation in the dependent variable.
.00 to .15	Not Useful	Not Acceptable
.15 to .20	Very Weak	Minimally acceptable
.20 to .25	Moderately Strong	Acceptable
.25 to .30	Fairly Strong	Good Work
.30 to .40	Strong	Great Work
.40 to .60	Very Strong/Worrisomely Strong	Either an excellent relationship OR the two variables are measuring the same thing
.60 to .99	Redundant (?)	Proceed with caution: are the two variables measuring the same thing?
1.00	Perfect Relationship.	If we know the independent variable, we can predict the dependent variable with absolute success.

STANDARDS (FOR AGGREGATE DATA)

MAGNITUDE OF ASSOCIATION	QUALIFICATION	COMMENTS
0.00	No Relationship	Knowing the independent variable does not at all explain variation in the dependent variable.
.00 to .30	Not Useful, very weak	Not Acceptable
.30 to .50	Weak	Minimally acceptable
.50 to .70	Fairly Strong	Acceptable
.70 to .85	Strong	Good Work
.80 to .90	Very Strong/Worrisomely Strong	Either an excellent relationship OR the two variables are measuring the same thing
.90 to .99	Redundant (?)	Proceed with caution: are the two variables measuring the same thing?
1.00	Perfect Relationship.	If we know the independent variable, we can predict the dependent variable with absolute success.

EXAMPLE

Dataset:
- ANES 2012
Dependent Variable:
Independent Variables:

Hypothesis Arrow Diagram:

Democratic Id–> Egal
Positive Feeling toward Democrats –>Egal
Improved Finances –> Egal
Improved Econ –> Egal

Syntax

*Identifying EconEq Index Items*.
weight by weight_full.
missing values cses_govtact (-9 thru -6).
recode cses_govtact (1=1) (2=.75) (3= .5) (4= .25) (5=0) into eceq1.

missing values ineqinc_ineqreduc (-9 thru -6).
recode ineqinc_ineqreduc (1=1) (2=0) (3= .5) into eceq3.

missing values guarpr_self (-9 thru -2).
recode guarpr_self (1=1) (2=.832) (3= .666) (4= .5) (5= .332) 
  (6= .166) (7=0) into eceq5.

*Conducting Reliabiility Analysis*.
reliability
   /variables= eceq1 eceq3 eceq5
   /scale (EcEq3) eceq1 eceq3 eceq5
   /summary = all.

*Constructing the Index*.
compute RawEqIndex = eceq1 + eceq3 + eceq5.
fre var RawIndex
   /statistics = mean median stddev skew kurtosis.

*Recoding the Index*.
recode RawEqIndex (.00 thru 1.00 =1) (1.01 thru 1.85 =2) 
   (1.86 thru 3 = 3) into IEcEq3.
fre var IEcEq3
   /statistics mean median stddev skew kurosis.

*Creating Independent Variables*.
missing values pid_self (-9 thru 0, 5).
missing values pid_x (-2).
recode pid_self (1=1) (3 = .5) (2=0) into pid.
value labels pid 1 'Dem' .5 'Ind' 0 'Rep'.

*partisan feeling thermometers*.
missing values ft_dem ft_rep (-2, -8, -9).
fre var ft_dem ft_rep.

*Personal finance-past & future*.
missing values finance_finpast_x (-9 thru -1).
missing values finance_finnext_x (-9 thru -1).
fre var =finance_finpast_x finance_finnext_x
   /statistics mean median stddev.

*Economy-past & future*.
missing values econ_ecpast_x (-9 thru -1).
missing values econ_ecnext_x (-9 thru -1).

*Correlational Analysis*.
corr variables = raweqindex ft_dem pid finance_finpast_x econ_ecnext_x.

Output

Correlations
		RawEqI	DemFeel	DemPid	Personal finances	US Econ
RawEqIndex	Correlation	1	.511	.465	-.150	-.250
	Sig.		.000	.000	.000	.000
	N	5047	5014	4777	5001	4920
DemFeel	Correlation	.511	1	.681	-.270	-.396
	Sig.	.000		.000	.000	.000
	N	5014	5856	5529	5800	5695
Dem PID	Correlation	.465	.681	1	-.211	-.278
	Sig.	.000	.000		.000	.000
	N	4777	5529	5559	5508	5416
Personal finances	Correlation	-.150	-.270	-.211	1	.299
	Sig.	.000	.000	.000		.000
	N	5001	5800	5508	5845	5683
US Econ	Correlation	-.250	-.396	-.278	.299	1
	Sig.	.000	.000	.000	.000
	N	4920	5695	5416	5683	5738

Interpretation
- The correlation matrix is symmetrical, meaning that each variable selected for the correlation analysis appears in both the rows and the columns. This leads to redundant figures above and below the diagonal.
- Looking at the first column, which represents our dependent variable, we can see the Pearson’s r value at the top of each cell. In the middle of the cell is the significance level. At the bottom of each cell is the sample size (N). For instance, we can see that for cell [RawEqIndex X personal finances] Pearson’s r= -.150, significance=.000, n=5001.
- The sign of the coefficient is negative, meaning that the better one’s personal finances the less likely they are to endorse egalitarian attitudes.
- Since this is public opinion data, we conclude that there is a very weak inverse relationship between personal finances and egalitarian attitudes
- Since r-square = (-.150)²=.023, we know that personal finances explains 2.3% of the variation in egalitarian attitudes in this data set.
- The second column or row of the table shows the relationship between feelings toward the Democrats and identifying as a Democratic is .681, which is high enough to suggest that these two variables are actually measuring the same concept. When you use your own variables, you should use theory and your knowledge of the world to decide whether the relationship is too high, or just an excellent predictor.
- Try to rerun this analysis using the indicators for anticipated future finances and economic performance or feelings toward the Republicans as IVs. These variables are included in the above syntax so they need only be included on the correlations command.

INSTRUCTIONS

Stage 1

Select any of the available datasets for the purpose of this exercise.
Hypothesize a relationship between two interval or ordinal variables in the data set. Although the correlation will only describe a relationship, please think about which is the dependent variable and which is the independent variable.
- For example, an individual’s attitude toward egalitarianism [dependent variable] depends on his or her partisanship [independent variable].
Run the Frequency distribution for each of the variables. Based on the Frequency output, decide how to recodeeach variable (if necessary) and identify the missing values.
Use SPSS and select the chosen dataset.
Select “Correlation” or enter the appropriate syntax using the example from above.
Enter the dependent & independent variables in the entry boxes for ‘Step 1’. Remember to also enter the missing values.
Enter any recodes (if necessary) in ‘Step 3’ and hit Run.
Judge whether the relationship meets the standards based on the magnitude of Pearson’s r value and also based on whether the significance is below .05.
Calculate the explanatory power of the independent variable over the dependent variable. That is, calculate r-square (r²).
Repeat the analysis until you find a set of variables with a relationship that has a correlation value that meets the standards above.
Compose a few sentences explaining your analysis and results.

Stage II

Hypothesize at least two other independent variables that may explain variation in the dependent variable. Examples are given below.
Run frequency distributions for each variable to determine recodes.
Edit your syntax to include the additional variables.
Note that the syntax to run the correlation is:
- correlation DependentVar IndepVar
  /statistics=all.
Add other variables to the correlation line to create a matrix showing the correlations for the combination of the variables entered.
Your syntax for the correlation should now be:
- correlation DependentVar IndepVar1 IndepVar2 IndepVar3
  /statistics=all.
Rerun your edited the syntax.
Repeat this stage until you have found at least three independent variables that have acceptable correlations with the dependent variable.

Additional Syntax

*Additional IVs*.
*Demographics*.
missing values dem_age_r_x (-9 thru -1).
missing values dem_agegrp_iwdate (-9 thru -1).
fre var dem_age_r_x dem_agegrp_iwdate
   /statistics = mean median stddev.

missing values inc_totinc (-9 thru -1).
missing values inc_incgroup_pre (-9 thru -1).
fre var inc_totinc inc_incgroup_pre
   /statistics mean median stddev.

missing values dem_edugroup (-9, -2).
fre var dem_edugroup.

freq var gender_respondent.

*Others*.
missing values health_2010hcr_x (-9, -8).
fre var pid_self, pid_x health_2010hcr_x.

missing values libcpre_self (-9 thru -2).
fre var libcpre_self.

fre var egal_equal egal_toofar egal_bigprob egal_worryless egal_notbigprob 
   egal_fewerprobs,
*Identifying Egalitarian Index Items*.
missing values egal_equal to egal_fewerprobs (-9 thru -6).
fre var egal_equal to egal_fewerprobs.

QUESTIONS FOR REFLECTION

How much more variation is explained by your 1^st highest ranked independent variable compared to your 2^nd highest ranked independent variable?
How does Pearson’s r differ from the other measures of association?
Does the value of Pearson’s r depend on which of the variables is the dependent variable and the independent variable?
How can you use correlation analysis to find relationships?
What are some good practical uses of correlational analysis?
Why are the standards for public opinion data different than standards for aggregate opinion data?

DISCUSSION

To determine how much more variation is explained by one independent variable compared to another, take the difference of the r-squared values, not the difference in r values. That is, calculate r_A² – r_B².
Unlike the other measures of association, Pearson’s R allows us to calculate the explanatory power of a relationship. To see the tau-b coefficients for the same matrix of variables use this syntax:
```
nonpar corr
    /variables= raweqindex ft_dem pid finance_finpast_x econ_ecnext_x
    /print kendall.
```
Like other measures of association, Pearson’s r only measures correlation, which is distinct from causation. So the Pearson’s r value will be the same whichever variable you identify as dependent or independent.
To blindly find strong relationships, simply plug in all the variables you think may be related to one another into a correlation matrix. Then look at the Pearson’s r in each of the cells to find out which variables are related to one another.
One practical use of a correlation matrix is to find variables that would make suitable indexes. By finding variables that have high Pearson’s r in the matrix you will have an idea of which variables will be suitable for an index. For example, you may find that variables A&B, B&C, and A&C are all strongly correlated to one another. Once you know that all these three variables are strongly correlated to one another, you can try including all 3 in a reliability-run to see whether they make a good index.
Also, another good use for Pearson’s correlation matrix is to find good independent variables to explain your dependent variable. This will be useful when we get to Multivariate Regression.