new lab 6

UCSC LAB MANUAL: Lab 6

Crosstabulation with Nominal Variables

PURPOSE

  • To learn how to perform a crosstabulation and practice formulating hypotheses.
  • To learn how to interpret crosstabs where at least one variables is nominal.
  • To learn how to measure the strength of the relationship between two variables.
  • To learn how to apply the basic measures of association: phi and Cramer’s V

MAIN POINTS

Crosstabulation

  • Crosstabulation brings together two variables and displays the relationship between them in a single table. Each column in the crosstab corresponds to a category of the independent variable, and each row corresponds to a category in the dependent variable. Hence the dependent variable goes on the left, and the independent variable goes on the top.
  • Each cell represents a unique combination of categories from each of the variables. For example, in the table below, the cell “G” represents all the respondents who selected Category I for the independent variable and Category III for the dependent variable.
  • The percentage in each cell is calculated by dividing the number of respondents in the cell by the total number of respondents for the column. Note: the cell-percentage values will be wrong if the missing values are not eliminated. Pay attention to the percentages in each cell rather than the number (n) of respondents in each cell.
  • To interpret crosstabs compare the column-percentages across the rows to see whether they differ. For instance, in the table below, compare the percentage values for cells A, B, and C, then compare D, E, and F, and finally compare G, H, and I. If the column-percentages of cells A-B-C, and/or D-E-F, and/or G-H-I markedly differ from one another then you have found a relationship.
INDEPENDENT VARIABLE
Category I Category II Category III
DEPENDENT VARIABLE Category I A B C
Category II D E F
Category III G H I

Measures of Association: Nominal data–Phi and Cramer’s V

  • Measures of Association calculate the strength, and for ordinal variables the direction, of the relationship between two variables.
    • PHI is used to measure the strength of the association between two variables, each of which has only two categories. (It applies to 2 X 2 nominal tables only).
    • CRAMER’S V is used to measure the strength of the association between one nominal variable with either another nominal variable, or with an ordinal variable. Both of the variables can have more than 2 categories. (It applies to either nominal X nominal crosstabs, or ordinal X nominal crosstabs, with no restriction on the number of categories.)
  • Interpreting the value of the Level of Association:
LEVEL OF ASSOCIATION Verbal Description COMMENTS
0.00 No Relationship Knowing the independent variable does not help in predicting the dependent variable.
.00 to .15 Very Weak Not generally acceptable
.15 to .20  Weak  Minimally acceptable
.20 to .25 Moderate  Acceptable
.25 to .30 Moderately Strong  Desirable
.30 to .35 Strong  Very Desirable
.35 to .40 Very Strong Extremely Desirable
.40 to .50 Worrisomely Strong Either an extremely good relationship or the two variables are measuring the same concept
.50 to .99 Redundant The two variables are probably measuring the same concept.
1.00 Perfect Relationship. If we the know the independent variable, we can perfectly predict the dependent variable.

INSTRUCTIONS

Crosstabulating Nominal Data

  1. Select an available Dataset for this exercise, perhaps from among the 2018-19 PPIC data sets available on the DataArt website.
  2. Enter the Codebook for the chosen dataset.
  3. Hypothesize a relationship between two indicators measured at the nominal level of measurement.
    • For example, using the October 2016 PPIC data one might suspect that support for intended vote on the Marijuana referendum might be related to gender or language with men and anglophones being more supportive.
  4. Open the relevant data set with SPSS.
  5. Perform separate Frequency distributions for each of the variables. Based on the Frequency output, declare the appropriate missing values and recodeeach variable as needed.
  6. Conduct a Bivariate Crosstabulation relating your Dependent and Independent variables using the syntax structure demonstrated in the example shown below.
  7. Take care to enter the dependent variable first, followed by the independent If the variables are placed appropriately, the DV will appear on the left of the crosstab and the IV will appear across the top (See diagram above).
  8. Specify the appropriate cell contents and summary statistics on the second and third lines of your syntax.
  9. When evaluating the measures of association, you should look at only Phi for 2 by 2 tables and Cramer’s V for other nominal tables.
  10. Determine whether there is a relationship between the variables based on the column-percentages in the crosstab. Then, looking at the value of the measure of association, use the above guidelines to interpret the strength of the relationship.
  11. Repeat the analysis until you find a set of variables with a relationship that has a moderate degree of association ( >.2).

EXAMPLES

Example #1: Using phi with two dichotomous variables

EXAMPLE

  • Dataset: PPIC Statewide Survey October 2016
  • Y Variable:  Intended Vote regarding recreational marijuana intiative
  • Indicator for Y
    Q21. “Proposition 64 is called the ‘Marijuana Legalization. Initiative Statute’ … If the election were held today, would you vote yes or no on Proposition 64?”
  • Possible Explanations (X1 gender, X2 lang)
    Gender of respondent
    Language groups
  • Indicators for X
    XGender
    X2 Language of Interview
  • Arrow Diagrams :
    • X → Y
    • Gender → Voting Intention on Marijuana Initiative
    • Language Group→ Voting Intention on Marijuana Initiative
  • Syntax:

    *Preparing the DV*.
    missing values q21 (8,9).
    recode gender (1=0) (2=1) into female
    
    *Running the Crosstabulation*.
     crosstabs 
        /tables=q21 BY female, lang
        /cells=column count
        /statistics = phi.
  • Syntax Legend
    • Missing Values And Recodes Determined by the trial-run of the Frequencies output.
    • Recoding the ambiguous dichotomy Gender into the clearer Female is recommended.
    • Crosstab command: This tells SPSS which variables to use in the table. enter the Dependent variable first, then the Independent
    •    /cells =: This tells SPSS to put column percentages and frequencies in each cell. Make sure to indent this continuation of the Crosstabs command.
    •    /statistics =: This of syntax is included as part of the crosstab command in order to calculate the nominal Measures of Association (phi and Cramer’s V).
  • Output:
Crosstabulation of Initiative Vote intention by Gender
Female Total
Male Female
Q21. Proposition 64 is called the ‘Marijuana Legalization. Initiative Statute.’ If the election were held today, would you vote yes or no on Proposition 64? yes Count 406 306 712
% within Female 62.1% 48.3% 55.3%
no Count 248 327 575
% within Female 37.9% 51.7% 44.7%
Total Count 654 633 1287
% within Female 100.0% 100.0% 100.0%

Symmetric Measures

Value

Approximate Significance

Nominal by Nominal

Phi

.138

.000

Cramer’s V

.138

.000

N of Valid Cases

1287

Source: PPIC October 2016

Edited Version of Table:

Intended Vote on Marijuana Proposition by Gender

          Female
Male Female
Q21.

vote yes or no on Proposition 64?

yes 62.1% 48.3%
no 37.9% 51.7%
Total 654 633

phi = .138
The edited version of such tables are easier to absorb. So only the edited version of the second table is presented.

Intended Vote on Marijuana Initiative by Language of Interview

Language
English Spanish
Intended

Vote

Yes 59.1% 16.7%
No 40.9% 83.3%
Total 1173 114

Phi = .242

  • Crosstab Legend:
    • The number at the top of each cell is the number of cases (n), and the number at the bottom of each cell is the column percentage. (You may find that the row total figures will slightly differ from the figures you would get from individual Frequency analyses. This is because some of the people who responded to one variable did not respond to the second and hence are eliminated by the missing values statement. So you can expect that the number of missing cases will be slightly higher in the crosstab than it would be was the individual frequency analysis.) Amongst all of these figures in the output, the most important for the your assessment will be the column-percentage for each cell.
  • Measures of Association Legend:
    • For the present time, the only aspect of the ‘symmetric measures’ output that you have to note is the ‘Value’ columns.
    • While you can ignore the ‘Approximate Significance’ column for the time being, this will become important after we learn its meaning later in the course.
  • Interpretation of Crosstab:
    • In the first (Q21 * gender) crosstable comparing the column-percentages for the cells in the  ‘Yes’ row, we can see that there is a notable difference. These column percentages differ by 13.8%.  A similar difference can also be observed in the ‘No’ row.
    • This indicates that male respondents are more likely to say they intend to vote for the initiative than female respondents
    • And male respondents are less likely to say they intend to vote ‘No’ than are females.
    • Since the crosstab is a 2 X 2 table, we know that Phi is the appropriate measure of association.  The value of Phi is .138, which means that the relationship between Gender and Vote Intention is a Very Weak.
    • The value of Phi may be negative if the variables are coded in a particular way. The meaning of a negative measure of association will be discussed below.  For the time being, recognize it the positive phi value here means that most of the cases are on the main diagonal of the table.
    • The Phi of .138 only very weakly supports the hypothesis (X1 → Y) that gender is related to vote intention on the marijuana initiative..
    • Looking at the second (Q21 * Language) table shows a greater percentage difference across the rows. The difference between respondents to English and Spanish interviews is 42.4%. The table produces a rather stronger relationship as indicated by Phi = .242. This reflects an even greater proportion of cases on the main diagonal than the off diagonal than was evident in the first table. So the (X2→ Y) hypothesis that language group is related to intended vote is supported as indicated by the moderate relationship between X2 and Y.  Thus, language or ethnicity may be worth further investigation.Example #2: Cramer’s V
  • Dataset
    • PPIC Statewide Survey October 2016
  • Dependent Variable:
    • PPIQ21. “Proposition 64 is called the ‘Marijuana Legalization. Initiative Statute’ … If the election were held today, would you vote yes or no on Proposition 64?”
  • Independent Variable:
    • Respondent Ethnicity
  • Arrow Diagram
    • X2 → Y
    • Hispanic Ethnicity → Less likely to support Initiative.
  • Syntax
    missing values d8com (9).
    recode d8com (3=1) (4=2) (else =3) into ethn.
    value labels ethn 1 'Hisp' 2 'White' 3 'other'.
    fre var ethn
     /statistics = mode median mean.
    
  • Syntax Legend
    • Recode indicator for IV into new variable name, with new value labels.
    • Crosstab Command with the DV first, followed by the IV.
    •    /statistics = phi is a syntax subcommand which instructs SPSS to calculate nominal measures of association, in this case Cramer’s V is reported.
  • Output
    Intended Vote on Marijuana Initiative by Ethnicity

              ethnicity
    Hisp White other
    Intended Vote yes 47.0% 56.9% 62.0%
    no 53.0% 43.1% 38.0%
    Total 345 671 271

    Cramer’s V = .109

    Interpretation of Results

    Start by looking at the column percentages and compare across the rows. In the ‘yes’ row, which represents the percentage of people intending to vote yes on the marijuana initiative,  we see that 47% of self identified Hispanics intend to vote in favor of the initiative, compared to 56.9% of self identified whites and an even greater percentage of ‘others’ at 62%. This suggests that Hispanics are least supportive of legalizing recreational marijuana. We see the reverse pattern in the second (no) row of the table where Hispanics are more likely to be than any of the other ethnic groups. These results are consistent with the hypothesis that Hispanics are less likely than other ethnic groups to signal support for the initiative.

    • Keep in mind we are comparing column percentages here, not row percentages. Hence it is, for example, incorrect to observe that 47% of those who intend to vote ‘yes’ are Hispanic. Note that the percentages do not sum to 100 across the rows. However the column percentages do.
    • Taken together, the differences in the column percentages show substantial ethnic differences in support for the marijuana initiative with Hispanics being least supportive,  others being most, and Whites in the middle. Moreover, one might say that each of the groups seem to differ from one another, So recoding any of the three groups together doesn’t seem to make sense.
  • Interpretation of Cramer’s V:
    1. Since this crosstab involves a nominal independent variable with several categories, the appropriate measure of association to use in summarizing the relationship is Cramer’s V.
    2. Generally we would not use Phi because it is only appropriate for 2 X 2 tables. In this case, however, the magnitude of the two measures is the same.
    3. The Cramer’s V value is .109, rounded to .11. Using the standards above, this relationship is very weak, and not generally acceptable, despite the percentage differences reviewed above .
    4. Based on the summary measure we have found only very slight differences among ethnic groups in their expressed voting intentions.QUESTIONS FOR REFLECTION
  1. Which measure of association should you use in a table like the one depicted here?
    INDEPENDENT VARIABLE
    Category I Category II Category III
    DEPENDENT VARIABLE Category I A B C
    Category II D E F
    Category III G H I
  2. Would the strength of the relationship be affected if you looked only at the results for some of the categories of the second  IV?
  3. Must the entire crosstabulation be presented?

DISCUSSION

  1. If either or both of the variables are measured at the nominal level, Cramer’s V would be appropriate.
  1. We can compare two values of the same measures readily.  But be cautious about comparing different measures of association to each other. Eg., you should compare two measures of Phi to one other, but be cautious about comparing a Phi-value to a Cramer’s V value. Find out whether the value changes by including only two of the values on the ethnicity question.
  2. Only the summary measures can be presented, but there is some loss of information in doing so.

    Intended Vote for MJ initiative by Selected Nominal Predictors

         phi/v
    Female – .138
    Span Interv – .242
    Ethnicity   .109
    Region   .101
    Coastal   .106