Lab 13


Recoding’s Effect on Measures of Association.


  • To discover how recoding variables can affect the results.


  • We recode data in order to:
    • Reduce the number of categories in a variable to make the crosstab easier to interpret
    • Convert an interval variable such as age, education or income into an ordinal variable suitable for cross-tabulation by grouping categories.
    • Combine categories into new ones that have meaning in the context of your research.
    • Eliminate cells that have fewer than 5 cases
  • Every time we recode a variable, we change the distribution of values among the categories. Measures of association and statistical significance make calculations based on the distribution of the variables. So as we recode the categories within the given variables, the value of the measure of association and statisticsl significance may also change.
  • After recoding some variables we may be forced to apply a different Measure of Association for the recoded pair.  For example, if an independent variable is nominal with 4 categories and the dependent has 2, then we would apply Cramer’s V because the table is a 4X2.  But if we recode the 4 categories of the independent variable into 2 categories and leave the second variable unchanged then we will have a 2X2 table and we should apply Phi instead of Cramer’s V.
  • Recoding variables may produce unpredictable effects on the value of the Measure of Association.
  • Selecting the Appropriate Measure of Association
Crosstabulation Category Specifications Symmetry Specification Measure of Association Indication of Direction
Nominal X Nominal Only (2 X 2) Symmetrical Phi Yes
Nominal X Nominal Greater than (2X2) Cramer’s V No
Nominal X Ordinal At least (2 X 3) Cramer’s V No
Ordinal X Ordinal Square (e.g., 3 X 3) Symmetrical Kendall’s Tau-b Yes
Ordinal X Ordinal Rectangle  (e.g., 3 X 4) Asymmetrical Kendall’s Tau-c Yes
Interval X Interval (not taught yet) Pearson’s R (not yet taught) Yes

*Note that some of the specifications are flexible, but flexibility of these rules depends on the case in question.


  1. Select a dataset with which you are familiar.
  2. Hypothesize a relationship between two variables. Ensure that at least one of the variables has at least several categories.  This exercise will be easier if both of the variables you select have many categories.
    • For example, the education-level (independent) of an individual determines their income-level (dependent).
  3. Perform separate trial-runs of the frequency distribution for each of the variables. Based on the Frequency output, identify the missing values.  At this stage, only recode the variable if it necessary to re-arrange the categories from lowest to highest.
  4. Perform a crosstab and apply the appropriate measure of associationwithout having combined the categories.
  5. Make note of the value of the appropriate measures of association and statistical significance
  6. Now recode the variables by combining the categories for at least one of the variables and rerun the crosstab.  For instance, recode a variable with categories such as Very High, High, Moderate, Low and Very Low and combine the categories into ‘High’, ‘Moderate’, and ‘Low’. Or take a nominal variable for party affiliation and recode it so that the Conservatives form one category and the other party groupings form a second.
  7. Make note of the value of the measure of association and statistical significance for the crosstab after the recode.
  8. Note that you can compare the values of the Measures of Association directly to one another only if they have the same specifications described in the table above.  But do not be hesitate to recode the variables even if it changes the appropriate Measure of Association.
  9. Compare the values of the Measures of Association and significance before and after the recodes.  Also, make note of whether the same Measure of Association can be applied to the set of variables before and after the recode.


  • Dataset:
    • CES2011
  • Independent Variable:
    • Cps11_71.  Recoded into PID4 (from previous labs).
  • Dependent Variable:
    • Egalitarian index (from previous labs)
  • Hypothesis Arrow Diagram:
    • PID4–>Egalitarian Attitudes/li>


*Weighting the Data*.
weight by WGTSamp.

*Preparing indicators of Attitudes re Inequality*.
*declare missing values on pes11_41*.
missing values pes11_41 (8,9).

*reverse scoring on pes11_41 and make it range from 0-1*.
recode PES11_41 (1=1) (2=.75) (3=.5) (4= .25)
   (5=0) into undogap.
value labels undogap 0 'muchless' .25 'someless' .5 'asnow'
   .75 'somemore' 1 'muchmore'.

*rescale mbs11_k2 from 0-10 to 0-1 and reverse its scoring*.
missing values mbs11_k2 (-99).
compute govact = (((mbs11_k2 * -1) +10)/10).
value labels govact 0'not act' 1 'gov act'.

*recode and re-label mbs11_b3 and pes11_52b*.
recode mbs11_b3 (1=1) (2=0) into goveqch.
value labels goveqch 1 'decent living' 0 'leave alone'.

*create an indexed variable (alpha=.66).
compute rawegal = undogap + govact + goveqch.
fre var = rawegal.

*trichotomize new index using 33-33-33 split*.
recode rawegal (0 thru 2.10=0)(2.15 thru 2.50=.5)
   (2.55 thru 3= 1) into egal3.
value labels egal3 0 'low' .5 'med' 1 'hi'.

*trichotomize new index using 25-50-25 split*.
recode rawegal (0 thru 1.90=0)(1.95 thru 2.65=.5)
   (2.70 thru 3= 1) into egal325.
value labels egal325 0 'low' .5 'med' 1 'hi'.

*Preparing IV indicator-party identification*.
recode cps11_71 (2=1) (1=2) (4=3) (3=4)into PID4.
value labels PID4 1 'Cons' 2 'Lib' 3 'BQ' 4 'NDP'.

*dichotomize IV*.
recode PID4 (1=1) (2,3,4 = 0) into PID2.
value labels PID2 1 'Conserv' 0 'Other'.

*Create 4 crosstabs*.
crosstabs tables = egal3 egal325 by PID4 PID2
   /cells = column count
   /statistics = phi chisq.
  • Syntax Legend
    • This syntax builds on syntax in previous labs
    • Both DV and IV are recoded.
    • The DV is trichotomized in two ways. Both use the cumulative percent column from the frequency distribution.
    • One trichotomy uses the 33-33-33 split from previous labs,
    • A second trichotomy uses a 25-50-25 split to delineate more clearly extreme scores on the DV.
    • The IV also appears in two forms.
    • One is a four-category form used previously.
    • The second is dichotomized from the four-category form into Conservative and other.


Output Summary

Egal3 V=.277
w/ 6df
w/ 2df
Egal325 V=.295
w/ 6df
w/ 2df


  • Interpretation
    • Notice that recoding both the IV and DV increases the Measure of Association.
    • The IV recode has more dramatic effects than the DV recode.
    • Together the recodes increase Cramer’s V by .1
    • A more typical result is either a small increase in the value, or a small drop in the Measure of Association.


  • Was there a difference in the value of the measure of association after you recoded your indicators?
  • Did you apply the same Measure of Association to the variables after the recoding as you did before the recoding?
  • How can recoding be used to mislead?
  • What should be the guiding factor on how you choose to recode an indicator?
  • Suppose you suspect there is a relationship between two variables, but the measure of association has a value that is slightly below .15.  Is the relationship still worth considering?


  • The effect of recoding on the value of the M.O.A. is unpredictable.
  • It is important not to combine categories arbitrarily in order to fabricate a specious relationship between two variables.  The IV recode shown above may be just such an instance. It is, however, acceptable to recode categories in order to illuminate “hidden” relationships.  However, it is not acceptable to try to force a relationship to exist between two variables by disregarding whether the new combination of categories makes sense.  For example, you may be able to force a relationship to exist between age and another variable by strategically selecting age groups.  It is acceptable to recode age groups only if the new categories follow a reasonable pattern (20-30, 30-40,…) or if the groupings are based on a premise that has meaning in your research, such as life-stage, or generations…etc.
  • The primary guide in recoding should be to group the data into categories that make sense in the context of your research.
  • If the Measure of association has a value that is slightly less than .15, the relationship may still be worth considering.  Combining some of the categories in the variables may illuminate the strength of the association.  If the measure of association values are low after the variables have been recoded in all conceivable ways (and all the missing values have been eliminated, there are no cells with dangerously low n’s, and the variables are not skewed), then we should conclude that the relationship is not worth pursuing any further.