new Lab 11 | Data Art

Chi-Square

PURPOSE

Introduce statistical significance
Learn how to perform the Chi-square test for significance.
Learn how sample-size and standard deviation affect the level of significance.

MAIN POINTS

Significance

Our studies are typically based on samples for reasons of economy.
All of the public opinion studies used in our labs and assignments thus far represent only a sample of a population.
Generally, we use sample findings as a basis for inferences about some population.
The sample (n) consists only of those individuals who were directly surveyed for the study.
The population (N) consists of the wider collection of individuals or cases about which we want to generalize.
In order to determine whether a sample is representative of a broader population, we measure statistical significance.
In calculating significance we ask whether the result obtained in the sample is representative of the population, or due to chance.
Inferential statistics such as Chi-Square and ANOVA address sampling error only, not errors in question construction or errors in the coding and weighting of the data.
Although based upon different sampling distributions, Chi-Square and Anova use the same standards for interpretation.
This lab will discuss using Chi-square; Lab 12 will take up Anova. Lab 13 addressing T-Tests will not be covered in class but you may work through it on your own.

MAIN POINTS

To determine whether a relationship in a cross tabulation is significant (not due to sampling error) use the Chi-square (Χ²) test.
Χ² works by comparing observed frequencies in a table with what we would expect to find if there is no association between the variables (as indicated by differences across the columns). The greater the difference between what we observe and what we expect (no association), the less likely it is due to sampling error.
Cells should not have fewer than 5 cases.
Χ² indicates only whether an observed relationship may be due to sampling error. As such, Χ²is not a measure of the strength of a relationship. Nor does it indicate the direction of a relationship.
If the probability level is less than .05, we conventionally infer that the observed relationship may be generalized to the population.
Statistical significance depends in part on sample size so as the number of cases in the sample (n) grows statistical significance is easier to achieve. As the sample size shrinks, significance is more difficult to find.
A significant Chi-square tells us only that there is a significant difference in the table, not specifically where that difference is.
As we will see in the next lab, we can determine which specific categories on the independent variable have significantly different values on the dependent variable by using ANOVA .

CONVENTIONAL STANDARDS FOR STATISTICAL SIGNIFICANCE

probability level	Description	Acceptability	GENERAL COMMENTS	Chi-square INTERPRETATION
.00 to .01	Highly Significant	GENERALLY ACCEPTABLE	The difference in the sample is very likely representative of the population, not due to chance.	The differences in the dependent variable across the categories of the independent variable are *very unlikely to be due to chance*.
.01 to .05	Significant	CONVENTIONALLY ACCEPTABLE	The difference in the sample is likely representative of the population	The differences in the dependent variable across the categories of the independent variable are *unlikely to be due to chance*.
.05 to 0.1	Marginally significant	Acceptable in some circumstances.	The difference may well be due to chance	The differences in the dependent variable across the categories of the independent variable may be due to chance.
.1 to 1.00	Non-significant	NOT ACCEPTABLE	The measurement in the sample is likely due to chance	The differences in the dependent variable across the categories of the independent variable *are likely due to chance.*

EXAMPLE

Dataset:
- PPIC October 2016
Independent Variables:
- Party ID & Ideology
Dependent Variable:
- Attitude toward recreational marijuana
Hypothesis Arrow Diagram:
- More Democratic → More in favor of recreational marijuana use

CHI-SQUARE TEST FOR SIGNIFICANCE

Syntax

*Weighting the Data*.
weight by weight.

*Recoding MJ Index Items*.
recode q21 (1=1) (2=0) into MJPropD.
value labels MJPropD 1 'yes' 0 'no'.
recode q36 (1=1) (2=0) into MJLegalD.
value labels MJLegalD 1 'yes' 0 'no'.
recode q36a (1=1) (2=.5) (3=.0) into MJTry.
value labels MJTry 1 'recent' .5 'not recent' 0 'no'.

*Constructing an Index with alpha = .777*.
compute RawMJ3 = (MJPropD + MJLegalD + MJTry).

*Recoding the Index*.
recode RawMJ3 (0, .5=0) (1 thru 2= .5) (2.5, 3 =1) into MJ3.
value labels MJ3 0 'low' .5 'med' 1 'hi'.

*Creating IV Indicators of Party Identification & Ideology*.
recode q40c (1=0) (3=.5) (2=1) into Democrat.
value labels Democrat 1 'Democ' .5 'Indep' 0 'Repub'.
recode q37 (1,2=1) (3=.5) (4,5= 0) into liberal3.
value labels liberal3 1 'liberal' .5 'middle' 0 'conserv'.

*Crosstabulation of MJ3 by Democrat & Liberal*.
crosstabs tables = MJ3 by Democrat,liberal3
  / cells = column count
  /statistics = btau chisq.

Syntax Legend

Note that data have been weighted
Syntax has been edited to exclude reliability and frequency analyses completed in previous labs
The DV has been renamed MJ3 to indicate it is an index recoded into 3 values
Crosstabulation is a necessary prelude to Chi-Square
The chisq specification is added to the /statistics subcommand following the measure of association.OutputSupport for Recreational Marijuana by Partisanship

Support for Recreational MJ Partisanship

Repub Indep Democ

Low 57.4% 28.0% 30.6%

Medium 23.5% 28.0% 32.2%

High 19.1% 43.9% 37.1%

Total 230 371 369

Taub = .147, Chisq = 67.7; 4 df; p = .000.
Source: PPIC October 2016

Support for Recreational Marijuana by Ideology

Support for Recreational MJ ldeology

conserv middle liberal

Low 57.3% 31.3% 19.3%

Medium 20.6% 32.3% 32.4%

High 22.1% 36.5% 48.3%

Total 335 288 358

Taub = .284: Chisq = 116.3; 4 df; p = .000
Source: PPIC October 2016

Interpretation

- Glancing at the crosstabs, we can see that at least some of the cell-percentages differ as we scan across the rows of each table. This suggests that there may be a relationship between the DV and each of the IVs variables.
- The respective values of Chi-square in each table (67.7 & 116.3) and related degrees of freedom (4) are essential for calculating statistical significance but not for its interpretation.
- In interpreting the results the most important figure in the output is the significance or p value (reported by SPSS as Sig 2-sided), which is in each case .000. This means that there is a very slight chance (less than .0005%) that either the relationship between Party Identification or Ideology and MJ3 in the population is due to sampling error.
- We can conclude that there are statistically significant relationships between both IVs and support for recreational marijuana.
- Not all relationships reach conventional levels of statistical significance. See the additional examples below.
- Even when we do find a significant relationship we do not know which columns differ significantly (beyond what one would expect due to chance).
- We can push our analysis further using ANOVA to identify which specific differences are significant. This will be discussed in Lab 12.

INSTRUCTIONS

Begin by selecting one of the available datasets and hypothesize a relationship between two variables. Either variable can be measured at the nominal, ordinal, or interval level.
Identify missing values and essential recodes using Frequency runs.
Prepare a crosstab analysis a usual.
Include chisq on the /statistics subcommand along with the appropriate measure of association.
Determine whether you can infer that a relationship between the two variables exists in the population based on what you observe in the sample by referring to the Significance for the Pearson’s Chi-square. Use the same guidelines as in the previous labs.
Repeat the steps above until you find a pair of variables that yield a significant relationship for the Chi-Square test.

QUESTIONS FOR REFLECTION

Even though the relationship in the cross tabulation clearly may be significant, is it possible for the variables not to be strongly related?
Since the level of significance level improves as we increase the sample size, why do surveys usually limit sample-size?
The significance of Chi-square applies to the table as a whole, but do we know specific columns differ significantly from one another?
What further analyses can we conduct which will enable us to do so?

DISCUSSION

Statistical significance and strength (or predictability) of association are two different things. One can have a weak relationship that is statistically significant or a strong relationship that is not significant. Notice too that sample size affects Chi-square.
After a certain sample-size, adding more cases does not much improve the significance level. At such a point, the marginal benefit of increasing the sample-size has to be considered against the cost associated with gathering more data. The tipping point is somewhere between 1500 and 2000 cases in most surveys.
A Chi-square analysis does not tell us which columns differ significantly unless, of course, the independent variable has only two columns.
To determine which specific columns in a multi-column crosstabulation differ from one another requires another approach.
An inefficient approach would be to construct a series of two column tables.
A more efficient way to proceed is to turn to Analysis of Variance, the topic of Lab 12.Additional Exercises
Try testing statistical significance using some of the IVs identified in Lab 7.
These include age, education, income, political interest and vote frequency.
Which of these relationships are significant and which are not?
The syntax for their construction is included here for your convenience.

Syntax from Lab 7:

recode d1a (1=0) (2= .2) (3= .4) (4=.6) (5=.8) (6=1) into age.
value labels age 0 '18+' .2 '25+' .4 '35+' .6 '45+' .8 '55+' 1 '65+'.

recode d6 (1=0) (2=.25) (3=.5) (4=.75) (5=1) into educ.
value labels educ 0 '<hs' .25 'hs' .5 'col' .75 'grad' 1 'post'.

recode d10 (1 =0) (2=.17) (3=.34) (4=.5) (5=.66) (6=.83) (7=1) into income.
value labels income 0 '<$20k' .17 '$20k+' .34 '$40k+' .5 '$60k+' .66 '$80k+' .83 '$100k+' 1 '$200k+' .

recode q38 (1=1) (2=.66) (3=.33) (4=0) into interest.
value labels interest 0 'none' .33 'only a little' .66 'fair amount' 1 'great deal'.

recode q39 (1=1) (2=.75) (3=.5) (4=.25) (5=0) into vote.
value labels vote 0 'never' .25 'seldom' .5 'part time' .75 'nearly' 1 'always'.

Now consider examining statistical significance using the ANES 2016 data with syntax at the end of Lab 10.

Support for Recreational MJ		Partisanship
Support for Recreational MJ		Repub	Indep	Democ
	Low	57.4%	28.0%	30.6%
	Medium	23.5%	28.0%	32.2%
	High	19.1%	43.9%	37.1%
Total		230	371	369

Support for Recreational MJ		ldeology
Support for Recreational MJ		conserv	middle	liberal
	Low	57.3%	31.3%	19.3%
	Medium	20.6%	32.3%	32.4%
	High	22.1%	36.5%	48.3%
Total		335	288	358