new lab 6 | Data Art

UCSC LAB MANUAL: Lab 6

Crosstabulation with Nominal Variables

PURPOSE

To learn how to perform a crosstabulation and practice formulating hypotheses.
To learn how to interpret crosstabs where at least one variables is nominal.
To learn how to measure the strength of the relationship between two variables.
To learn how to apply the basic measures of association: phi and Cramer’s V

MAIN POINTS

Crosstabulation

Crosstabulation brings together two variables and displays the relationship between them in a single table. Each column in the crosstab corresponds to a category of the independent variable, and each row corresponds to a category in the dependent variable. Hence the dependent variable goes on the left, and the independent variable goes on the top.
Each cell represents a unique combination of categories from each of the variables. For example, in the table below, the cell “G” represents all the respondents who selected Category I for the independent variable and Category III for the dependent variable.
The percentage in each cell is calculated by dividing the number of respondents in the cell by the total number of respondents for the column. Note: the cell-percentage values will be wrong if the missing values are not eliminated. Pay attention to the percentages in each cell rather than the number (n) of respondents in each cell.
To interpret crosstabs compare the column-percentages across the rows to see whether they differ. For instance, in the table below, compare the percentage values for cells A, B, and C, then compare D, E, and F, and finally compare G, H, and I. If the column-percentages of cells A-B-C, and/or D-E-F, and/or G-H-I markedly differ from one another then you have found a relationship.

		INDEPENDENT VARIABLE
		Category I	Category II	Category III
DEPENDENT VARIABLE	Category I	A	B	C
	Category II	D	E	F
	Category III	G	H	I

Measures of Association: Nominal data–Phi and Cramer’s V

Measures of Association calculate the strength, and for ordinal variables the direction, of the relationship between two variables.
- PHI is used to measure the strength of the association between two variables, each of which has only two categories. (It applies to 2 X 2 nominal tables only).
- CRAMER’S V is used to measure the strength of the association between one nominal variable with either another nominal variable, or with an ordinal variable. Both of the variables can have more than 2 categories. (It applies to either nominal X nominal crosstabs, or ordinal X nominal crosstabs, with no restriction on the number of categories.)
Interpreting the value of the Level of Association:

LEVEL OF ASSOCIATION	Verbal Description	COMMENTS
0.00	No Relationship	Knowing the independent variable does not help in predicting the dependent variable.
.00 to .15	Very Weak	Not generally acceptable
.15 to .20	Weak	Minimally acceptable
.20 to .25	Moderate	Acceptable
.25 to .30	Moderately Strong	Desirable
.30 to .35	Strong	Very Desirable
.35 to .40	Very Strong	Extremely Desirable
.40 to .50	Worrisomely Strong	Either an extremely good relationship or the two variables are measuring the same concept
.50 to .99	Redundant	The two variables are probably measuring the same concept.
1.00	Perfect Relationship.	If we the know the independent variable, we can perfectly predict the dependent variable.

INSTRUCTIONS

Crosstabulating Nominal Data

Select an available Dataset for this exercise, perhaps from among the 2018-19 PPIC data sets available on the DataArt website.
Enter the Codebook for the chosen dataset.
Hypothesize a relationship between two indicators measured at the nominal level of measurement.
- For example, using the October 2016 PPIC data one might suspect that support for intended vote on the Marijuana referendum might be related to gender or language with men and anglophones being more supportive.
Open the relevant data set with SPSS.
Perform separate Frequency distributions for each of the variables. Based on the Frequency output, declare the appropriate missing values and recodeeach variable as needed.
Conduct a Bivariate Crosstabulation relating your Dependent and Independent variables using the syntax structure demonstrated in the example shown below.
Take care to enter the dependent variable first, followed by the independent If the variables are placed appropriately, the DV will appear on the left of the crosstab and the IV will appear across the top (See diagram above).
Specify the appropriate cell contents and summary statistics on the second and third lines of your syntax.
When evaluating the measures of association, you should look at only Phi for 2 by 2 tables and Cramer’s V for other nominal tables.
Determine whether there is a relationship between the variables based on the column-percentages in the crosstab. Then, looking at the value of the measure of association, use the above guidelines to interpret the strength of the relationship.
Repeat the analysis until you find a set of variables with a relationship that has a moderate degree of association ( >.2).

EXAMPLES

Example #1: Using phi with two dichotomous variables

EXAMPLE

Dataset: PPIC Statewide Survey October 2016
Y Variable: Intended Vote regarding recreational marijuana intiative
Indicator for Y
Q21. “Proposition 64 is called the ‘Marijuana Legalization. Initiative Statute’ … If the election were held today, would you vote yes or no on Proposition 64?”
Possible Explanations (X₁ gender, X₂ lang)
Gender of respondent
Language groups
Indicators for X
X₁Gender
X₂ Language of Interview

Arrow Diagrams :
- X → Y
- Gender → Voting Intention on Marijuana Initiative
- Language Group→ Voting Intention on Marijuana Initiative

Syntax:

*Preparing the DV*.
missing values q21 (8,9).
recode gender (1=0) (2=1) into female

*Running the Crosstabulation*.
 crosstabs 
    /tables=q21 BY female, lang
    /cells=column count
    /statistics = phi.

```
Syntax Legend
```
- Missing Values And Recodes Determined by the trial-run of the Frequencies output.
- Recoding the ambiguous dichotomy Gender into the clearer Female is recommended.
- Crosstab command: This tells SPSS which variables to use in the table. enter the Dependent variable first, then the Independent
- /cells =: This tells SPSS to put column percentages and frequencies in each cell. Make sure to indent this continuation of the Crosstabs command.
- /statistics =: This of syntax is included as part of the crosstab command in order to calculate the nominal Measures of Association (phi and Cramer’s V).
Output:

Crosstabulation of Initiative Vote intention by Gender
			Female		Total
			Male	Female	Total
Q21. Proposition 64 is called the ‘Marijuana Legalization. Initiative Statute.’ If the election were held today, would you vote yes or no on Proposition 64?	yes	Count	406	306	712
	yes	% within Female	62.1%	48.3%	55.3%
	no	Count	248	327	575
	no	% within Female	37.9%	51.7%	44.7%
Total		Count	654	633	1287
Total		% within Female	100.0%	100.0%	100.0%

Symmetric Measures
	Value	Approximate Significance
Nominal by Nominal	Phi	.138	.000
Nominal by Nominal	Cramer’s V	.138	.000
N of Valid Cases	1287

Source: PPIC October 2016

Edited Version of Table:

Intended Vote on Marijuana Proposition by Gender

		Female
		Male	Female
Q21. vote yes or no on Proposition 64?	yes	62.1%	48.3%
	no	37.9%	51.7%

Total		654	633

phi = .138
The edited version of such tables are easier to absorb. So only the edited version of the second table is presented.

Intended Vote on Marijuana Initiative by Language of Interview

		Language
		English	Spanish
Intended Vote	Yes	59.1%	16.7%
Intended Vote	No	40.9%	83.3%
Total		1173	114

Phi = .242

Crosstab Legend:
- The number at the top of each cell is the number of cases (n), and the number at the bottom of each cell is the column percentage. (You may find that the row total figures will slightly differ from the figures you would get from individual Frequency analyses. This is because some of the people who responded to one variable did not respond to the second and hence are eliminated by the missing values statement. So you can expect that the number of missing cases will be slightly higher in the crosstab than it would be was the individual frequency analysis.) Amongst all of these figures in the output, the most important for the your assessment will be the column-percentage for each cell.
Measures of Association Legend:
- For the present time, the only aspect of the ‘symmetric measures’ output that you have to note is the ‘Value’ columns.
- While you can ignore the ‘Approximate Significance’ column for the time being, this will become important after we learn its meaning later in the course.
Interpretation of Crosstab:
- In the first (Q21 * gender) crosstable comparing the column-percentages for the cells in the ‘Yes’ row, we can see that there is a notable difference. These column percentages differ by 13.8%. A similar difference can also be observed in the ‘No’ row.
- This indicates that male respondents are more likely to say they intend to vote for the initiative than female respondents
- And male respondents are less likely to say they intend to vote ‘No’ than are females.
- Since the crosstab is a 2 X 2 table, we know that Phi is the appropriate measure of association. The value of Phi is .138, which means that the relationship between Gender and Vote Intention is a Very Weak.
- The value of Phi may be negative if the variables are coded in a particular way. The meaning of a negative measure of association will be discussed below. For the time being, recognize it the positive phi value here means that most of the cases are on the main diagonal of the table.
- The Phi of .138 only very weakly supports the hypothesis (X₁ → Y) that gender is related to vote intention on the marijuana initiative..
- Looking at the second (Q21 * Language) table shows a greater percentage difference across the rows. The difference between respondents to English and Spanish interviews is 42.4%. The table produces a rather stronger relationship as indicated by Phi = .242. This reflects an even greater proportion of cases on the main diagonal than the off diagonal than was evident in the first table. So the (X2→ Y) hypothesis that language group is related to intended vote is supported as indicated by the moderate relationship between X₂and Y. Thus, language or ethnicity may be worth further investigation.Example #2: Cramer’s V

Dataset
- PPIC Statewide Survey October 2016
Dependent Variable:
- PPIQ21. “Proposition 64 is called the ‘Marijuana Legalization. Initiative Statute’ … If the election were held today, would you vote yes or no on Proposition 64?”
Independent Variable:
- Respondent Ethnicity
Arrow Diagram
- X₂ → Y
- Hispanic Ethnicity → Less likely to support Initiative.

Syntax

missing values d8com (9).
recode d8com (3=1) (4=2) (else =3) into ethn.
value labels ethn 1 'Hisp' 2 'White' 3 'other'.
fre var ethn
 /statistics = mode median mean.

Syntax Legend
- Recode indicator for IV into new variable name, with new value labels.
- Crosstab Command with the DV first, followed by the IV.
- /statistics = phi is a syntax subcommand which instructs SPSS to calculate nominal measures of association, in this case Cramer’s V is reported.

Output
Intended Vote on Marijuana Initiative by Ethnicity

ethnicity

Hisp White other

Intended Vote yes 47.0% 56.9% 62.0%

no 53.0% 43.1% 38.0%

Total 345 671 271

Cramer’s V = .109

Interpretation of Results

Start by looking at the column percentages and compare across the rows. In the ‘yes’ row, which represents the percentage of people intending to vote yes on the marijuana initiative, we see that 47% of self identified Hispanics intend to vote in favor of the initiative, compared to 56.9% of self identified whites and an even greater percentage of ‘others’ at 62%. This suggests that Hispanics are least supportive of legalizing recreational marijuana. We see the reverse pattern in the second (no) row of the table where Hispanics are more likely to be than any of the other ethnic groups. These results are consistent with the hypothesis that Hispanics are less likely than other ethnic groups to signal support for the initiative.

- Keep in mind we are comparing column percentages here, not row percentages. Hence it is, for example, incorrect to observe that 47% of those who intend to vote ‘yes’ are Hispanic. Note that the percentages do not sum to 100 across the rows. However the column percentages do.
- Taken together, the differences in the column percentages show substantial ethnic differences in support for the marijuana initiative with Hispanics being least supportive, others being most, and Whites in the middle. Moreover, one might say that each of the groups seem to differ from one another, So recoding any of the three groups together doesn’t seem to make sense.
Interpretation of Cramer’s V:
1. Since this crosstab involves a nominal independent variable with several categories, the appropriate measure of association to use in summarizing the relationship is Cramer’s V.
2. Generally we would not use Phi because it is only appropriate for 2 X 2 tables. In this case, however, the magnitude of the two measures is the same.
3. The Cramer’s V value is .109, rounded to .11. Using the standards above, this relationship is very weak, and not generally acceptable, despite the percentage differences reviewed above .
4. Based on the summary measure we have found only very slight differences among ethnic groups in their expressed voting intentions.QUESTIONS FOR REFLECTION

Which measure of association should you use in a table like the one depicted here?

INDEPENDENT VARIABLE

Category I Category II Category III

DEPENDENT VARIABLE Category I A B C

Category II D E F

Category III G H I
Would the strength of the relationship be affected if you looked only at the results for some of the categories of the second IV?
Must the entire crosstabulation be presented?

DISCUSSION

If either or both of the variables are measured at the nominal level, Cramer’s V would be appropriate.

We can compare two values of the same measures readily. But be cautious about comparing different measures of association to each other. Eg., you should compare two measures of Phi to one other, but be cautious about comparing a Phi-value to a Cramer’s V value. Find out whether the value changes by including only two of the values on the ethnicity question.
Only the summary measures can be presented, but there is some loss of information in doing so.
Intended Vote for MJ initiative by Selected Nominal Predictors

phi/v

Female – .138

Span Interv – .242

Ethnicity .109

Region .101

Coastal .106

		ethnicity
		Hisp	White	other
Intended Vote	yes	47.0%	56.9%	62.0%
Intended Vote	no	53.0%	43.1%	38.0%
Total		345	671	271

	phi/v
Female	– .138
Span Interv	– .242
Ethnicity	.109
Region	.101
Coastal	.106