# UCSC Lab 15

**Poli 101 LAB ****15**

**Pearson’s Correlation**

**PURPOSE**

- To learn the meaning, use and interpretation of Pearson’s Correlation
- Discover how to calculate the amount of variation in that can be explained.

**MAIN POINTS**

- Pearson’s Correlation Coefficient (r) is designed to measure of the strength of a relationship between two interval variables. Pearson’s r can also be used to measure the strength of relationships between indicators measured at the ordinal level as long as they are coded appropriately, such that they run from low to high values with more or less equal & incremental categories.
- As in the case of Kendall’s Tau, the sign of
*r*indicates the direction of the relationship.- A positive sign means that as the first variable increases in value so too does the second variable.
- A negative sign indicates that as the first increases in value, the second variable decreases (or vice versa).

- The table below provides rough standards for how to evaluate the strength of the relationship for absolute values of
*r.*- The closer the value of
*r*is to the absolute value of 1, the stronger the relationship between the two variables. When*r*is close to zero, the relationship is very weak. - When evaluating intermediate values of
*r*consider whether you are using public opinion data (like a PPIC or election study data set) or aggregate data (like worlddata.sav). Both are available for download under the Data menu on our course website. Relationships between public opinion variables, especially with ordinal level data, tend to register lower*r*coefficients than aggregate level data. As a result, two sets of standards are provided: one (blue table) for public opinion data and one (green table) for aggregate data.

- The closer the value of
- A very useful aspect of Pearson’s
*r*is that it allows us to measure the amount of explanatory power that the independent variable has regarding variation in the dependent variable. More specifically, the value of r^{2}indicates the proportion of variation in the dependent variable that is explained by the variation in the independent variable. For example, if r= .35, then r^{2}=(.35)^{2}=.1225. In other words, the variation in the independent variable explains 12.25% of the variation in the dependent variable. - It is possible to compare r
^{2}values to one another to determine which relationship has the greatest explanatory power. As always care should be taken to ensure that missing values are properly handled, as their inclusion can substantially reduce correlations.

**STANDARDS**** (FOR PUBLIC OPINION DATA)**

MAGNITUDE OF ASSOCIATION |
QUALIFICATION |
COMMENTS |

0.00 | No Relationship | Knowing the independent variable does not at all explain variation in the dependent variable. |

.00 to .15 | Not Useful | Not Acceptable |

.15 to .20 |
Very Weak |
Minimally acceptable |

.20 to .25 |
Moderately Strong |
Acceptable |

.25 to .30 |
Fairly Strong |
Good Work |

.30 to .40 |
Strong |
Great Work |

.40 to .60 |
Very Strong/Worrisomely Strong |
Either an excellent relationship OR the two variables are measuring the same thing |

.60 to .99 | Redundant (?) | Proceed with caution: are the two variables measuring the same thing? |

1.00 | Perfect Relationship. | If we know the independent variable, we can predict the dependent variable with absolute success. |

**STANDARDS**** (FOR AGGREGATE DATA)**

MAGNITUDE OF ASSOCIATION |
QUALIFICATION |
COMMENTS |

0.00 | No Relationship | Knowing the independent variable does not at all explain variation in the dependent variable. |

.00 to .30 | Not Useful, very weak | Not Acceptable |

.30 to .50 |
Weak |
Minimally acceptable |

.50 to .70 |
Fairly Strong |
Acceptable |

.70 to .85 |
Strong |
Good Work |

.80 to .90 |
Very Strong/Worrisomely Strong |
Either an excellent relationship OR the two variables are measuring the same thing |

.90 to .99 | Redundant (?) | Proceed with caution: are the two variables measuring the same thing? |

1.00 | Perfect Relationship. | If we know the independent variable, we can predict the dependent variable with absolute success. |

**EXAMPLE**

*Dataset:*- ANES 2012

*Dependent Variable:*- Economic Equality EcEq (Alpha .70)
- Indicators: EcEq1(cses_govtact),
- EcEq2 (ineqinc_ineqreduc)
- EcEq3 (guarpr_self).

- Economic Equality EcEq (Alpha .70)
*Independent Variables:*- Democrat Id (pid),
- Feeling toward Democrats (ft_dem),
- Improved Finances (finance_finpast_x),
- Improved Econ (econ_ecpast_x).

*Hypothesis Arrow Diagram:*

- Democratic Id–> Egal
- Positive Feeling toward Democrats –>Egal
- Improved Finances –> Egal
- Improved Econ –> Egal
*Syntax*

*Identifying EconEq Index Items*. weight by weight_full. missing values cses_govtact (-9 thru -6). recode cses_govtact (1=1) (2=.75) (3= .5) (4= .25) (5=0) into eceq1. missing values ineqinc_ineqreduc (-9 thru -6). recode ineqinc_ineqreduc (1=1) (2=0) (3= .5) into eceq3. missing values guarpr_self (-9 thru -2). recode guarpr_self (1=1) (2=.832) (3= .666) (4= .5) (5= .332) (6= .166) (7=0) into eceq5. *Conducting Reliabiility Analysis*. reliability /variables= eceq1 eceq3 eceq5 /scale (EcEq3) eceq1 eceq3 eceq5 /summary = all. *Constructing the Index*. compute RawEqIndex = eceq1 + eceq3 + eceq5. fre var RawIndex /statistics = mean median stddev skew kurtosis. *Recoding the Index*. recode RawEqIndex (.00 thru 1.00 =1) (1.01 thru 1.85 =2) (1.86 thru 3 = 3) into IEcEq3. fre var IEcEq3 /statistics mean median stddev skew kurosis. *Creating Independent Variables*. missing values pid_self (-9 thru 0, 5). missing values pid_x (-2). recode pid_self (1=1) (3 = .5) (2=0) into pid. value labels pid 1 'Dem' .5 'Ind' 0 'Rep'. *partisan feeling thermometers*. missing values ft_dem ft_rep (-2, -8, -9). fre var ft_dem ft_rep. *Personal finance-past & future*. missing values finance_finpast_x (-9 thru -1). missing values finance_finnext_x (-9 thru -1). fre var =finance_finpast_x finance_finnext_x /statistics mean median stddev. *Economy-past & future*. missing values econ_ecpast_x (-9 thru -1). missing values econ_ecnext_x (-9 thru -1). *Correlational Analysis*. corr variables = raweqindex ft_dem pid finance_finpast_x econ_ecnext_x.

*Output*

Correlations |
||||||

RawEqI | DemFeel | DemPid | Personal finances | US Econ | ||

RawEqIndex | Correlation | 1 | .511 | .465 | -.150 | -.250 |

Sig. | .000 | .000 | .000 | .000 | ||

N | 5047 | 5014 | 4777 | 5001 | 4920 | |

DemFeel | Correlation | .511 | 1 | .681 | -.270 | -.396 |

Sig. | .000 | .000 | .000 | .000 | ||

N | 5014 | 5856 | 5529 | 5800 | 5695 | |

Dem PID | Correlation | .465 | .681 | 1 | -.211 | -.278 |

Sig. | .000 | .000 | .000 | .000 | ||

N | 4777 | 5529 | 5559 | 5508 | 5416 | |

Personal finances | Correlation | -.150 | -.270 | -.211 | 1 | .299 |

Sig. | .000 | .000 | .000 | .000 | ||

N | 5001 | 5800 | 5508 | 5845 | 5683 | |

US Econ | Correlation | -.250 | -.396 | -.278 | .299 | 1 |

Sig. | .000 | .000 | .000 | .000 | ||

N | 4920 | 5695 | 5416 | 5683 | 5738 |

*Interpretation*- The correlation matrix is symmetrical, meaning that each variable selected for the correlation analysis appears in
*both*the rows and the columns. This leads to redundant figures above and below the diagonal. - Looking at the first column, which represents our dependent variable, we can see the Pearson’s r value at the top of each cell. In the middle of the cell is the significance level. At the bottom of each cell is the sample size (N). For instance, we can see that for cell [RawEqIndex X personal finances] Pearson’s r= -.150, significance=.000, n=5001.
- The sign of the coefficient is negative, meaning that the better one’s personal finances the less likely they are to endorse egalitarian attitudes.
- Since this is public opinion data, we conclude that there is a very weak inverse relationship between personal finances and egalitarian attitudes
- Since r-square = (-.150)
^{2}=.023, we know that personal finances explains 2.3% of the variation in egalitarian attitudes in this data set. - The second column or row of the table shows the relationship between feelings toward the Democrats and identifying as a Democratic is .681, which is high enough to suggest that these two variables are actually measuring the same concept. When you use your own variables, you should use theory and your knowledge of the world to decide whether the relationship is too high, or just an excellent predictor.
- Try to rerun this analysis using the indicators for anticipated future finances and economic performance or feelings toward the Republicans as IVs. These variables are included in the above syntax so they need only be included on the correlations command.

- The correlation matrix is symmetrical, meaning that each variable selected for the correlation analysis appears in

**INSTRUCTIONS**

**Stage 1**

- Select any of the available datasets for the purpose of this exercise.
- Hypothesize a relationship between two interval or ordinal variables in the data set. Although the correlation will only describe a relationship, please think about which is the dependent variable and which is the independent variable.
- For example, an individual’s attitude toward egalitarianism

- For example, an individual’s attitude toward egalitarianism
- Run the Frequency distribution for each of the variables. Based on the Frequency output, decide how to
*recode*each variable (if necessary) and identify the*missing values*. - Use SPSS and select the chosen dataset.
- Select “
**Correlation**” or enter the appropriate syntax using the example from above. - Enter the dependent & independent variables in the entry boxes for ‘Step 1’. Remember to also enter the missing values.
- Enter any recodes (if necessary) in ‘Step 3’ and hit Run.
- Judge whether the relationship meets the standards based on the magnitude of Pearson’s r value and also based on whether the significance is below .05.
- Calculate the explanatory power of the independent variable over the dependent variable. That is, calculate r
**-square**(r^{2}). - Repeat the analysis until you find a set of variables with a relationship that has a correlation value that meets the standards above.
- Compose a few sentences explaining your analysis and results.

**Stage II**

- Hypothesize at least two
*other***independent**variables that may explain variation in the dependent variable. Examples are given below. - Run frequency distributions for each variable to determine recodes.
- Edit your syntax to include the additional variables.
- Note that the syntax to run the correlation is:
- correlation
**DependentVar IndepVar**

/statistics=all.

- correlation
- Add other variables to the correlation line to create a matrix showing the correlations for the combination of the variables entered.
- Your syntax for the correlation should now be:
- correlation
**DependentVar IndepVar1 IndepVar2 IndepVar3**

/statistics=all.

- correlation
- Rerun your edited the syntax.
*Repeat*this stage until you have found at least*three*independent variables that have acceptable correlations with the dependent variable.

**Additional Syntax**

*Additional IVs*. *Demographics*. missing values dem_age_r_x (-9 thru -1). missing values dem_agegrp_iwdate (-9 thru -1). fre var dem_age_r_x dem_agegrp_iwdate /statistics = mean median stddev. missing values inc_totinc (-9 thru -1). missing values inc_incgroup_pre (-9 thru -1). fre var inc_totinc inc_incgroup_pre /statistics mean median stddev. missing values dem_edugroup (-9, -2). fre var dem_edugroup. freq var gender_respondent. *Others*. missing values health_2010hcr_x (-9, -8). fre var pid_self, pid_x health_2010hcr_x. missing values libcpre_self (-9 thru -2). fre var libcpre_self. fre var egal_equal egal_toofar egal_bigprob egal_worryless egal_notbigprob egal_fewerprobs, *Identifying Egalitarian Index Items*. missing values egal_equal to egal_fewerprobs (-9 thru -6). fre var egal_equal to egal_fewerprobs.

**QUESTIONS FOR REFLECTION**

- How much more variation is explained by your 1
^{st}highest ranked independent variable compared to your 2^{nd}highest ranked independent variable? - How does Pearson’s r differ from the other measures of association?
- Does the value of Pearson’s r depend on which of the variables is the dependent variable and the independent variable?
- How can you use correlation analysis to find relationships?
- What are some good practical uses of correlational analysis?
- Why are the standards for public opinion data different than standards for aggregate opinion data?

**DISCUSSION**

- To determine how much more variation is explained by one independent variable compared to another, take the difference of the r-squared values, not the difference in r values. That is, calculate r
_{A}^{2}– r_{B}^{2}. - Unlike the other measures of association, Pearson’s R allows us to calculate the explanatory power of a relationship. To see the tau-b coefficients for the same matrix of variables use this syntax:
nonpar corr /variables= raweqindex ft_dem pid finance_finpast_x econ_ecnext_x /print kendall.

- Like other measures of association, Pearson’s r only measures correlation, which is distinct from causation. So the Pearson’s r value will be the same whichever variable you identify as dependent or independent.
- To blindly find strong relationships, simply plug in all the variables you think may be related to one another into a correlation matrix. Then look at the Pearson’s r in each of the cells to find out which variables are related to one another.
- One practical use of a correlation matrix is to find variables that would make suitable indexes. By finding variables that have high Pearson’s r in the matrix you will have an idea of which variables will be suitable for an index. For example, you may find that variables A&B, B&C, and A&C are all strongly correlated to one another. Once you know that all these three variables are strongly correlated to one another, you can try including all 3 in a reliability-run to see whether they make a good index.
- Also, another good use for Pearson’s correlation matrix is to find good independent variables to explain your dependent variable. This will be useful when we get to Multivariate Regression.