12.5 \(\chi\)2 Contingency Test
When the contingency table is of dimensions greater than 2 x 2, the most commonly applied test is the \(\chi\)2 Contingency Test.
For this activity we’re using the “worm” data associated with Example 9.4 on page 244 of the test. Please read the example!
12.5.1 Hypothesis statement
As shown in the text example, we have a 2 x 3 contingency table, and we’re testing for an association between two categorical variables.
Here are the null and alternative hypotheses (compare these to what’s written in the text):
H0: There is no association between the level of trematode parasitism and the frequency (or probability) of being eaten.
HA: There is an association between the level of trematode parasitism and the frequency (or probability) of being eaten.
- We use \(\alpha\) = 0.05.
- It is a two-tailed alternative hypothesis
- We’ll use a contingency test to test the null hypothesis, because this is appropriate for analyzing for association between two categorical variables, and when the resulting contingency table has dimension greater than 2 x 2.
- We will use the \(\chi\)2 test statistic, with degrees of freedom equal to (r-1)(c-1), where “r” is the number of rows, and “c” is the number of colums, so (2-1)(3-1) = 2.
- We must check assumptions of the \(\chi\)2 contingency test
- It is always a good idea to present a figure to accompany your analysis; in the case of a contingency test, the figure heading will include information about the sample size / total number of observations
12.5.2 Display the contingency table
Let’s generate a contingency table:
## fate highly lightly uninfected Total
## eaten 37 10 1 48
## not eaten 9 35 49 93
## Total 46 45 50 141
Hmm, the ordering of the categories of the categorical (factor) variable “infection” is the opposite to what is displayed in the text.
TIP The ordering of the categories is not actually crucial to this type of analysis, but it’s certainly better practice to show them in appropriate order!
In order to change the order of categories in a character variable, here’s what you do (consult this resource for additional info).
First, convert the character variable to a “factor” variable as follows:
Now check the default ordering of the levels using the levels
function (it should be alphabetical):
## [1] "highly" "lightly" "uninfected"
Now change the ordering as follows:
worm$infection <- factor(worm$infection, levels = c("uninfected", "lightly", "highly"))
levels(worm$infection)
## [1] "uninfected" "lightly" "highly"
Now re-display the contingency table, first storing it in an object “worm.table”:
## fate uninfected lightly highly Total
## eaten 1 10 37 48
## not eaten 49 35 9 93
## Total 50 45 46 141
That’s better!
Visualize it nicely with kable
:
fate | uninfected | lightly | highly | Total |
---|---|---|---|---|
eaten | 1 | 10 | 37 | 48 |
not eaten | 49 | 35 | 9 | 93 |
Total | 50 | 45 | 46 | 141 |
12.5.3 Visualize a mosaic plot
Here’s a mosaic plot with an ideal figure caption included:
worm %>%
ggplot() +
geom_mosaic(aes(x = product(infection), fill = fate)) +
scale_fill_manual(values=c("darkred", "gold")) +
scale_y_continuous(breaks = seq(0, 1, by = 0.2)) +
xlab("Level of infection") +
ylab("Relative frequency") +
theme_bw()
In the code above we manually specified the two fill colours using the scale_fill_manual
function.
12.5.4 Check the assumptions
The \(\chi\)2 contingency test (also known as association test) has assumptions that must be checked prior to proceeding:
- none of the categories should have an expected frequency of less than one
- no more than 20% of the categories should have expected frequencies less than five
To test these assumptions, we need to actually conduct the test, because in doing so R calculates the expected frequencies for us.
Conduct the test using the chisq.test
function from janitor
package. NOTE this again overlaps with a function name from the base R package, so we’ll need to specify that we want the “janitor” version of the function.
We need our contingency table as input to the function, but this time without margin totals.
We’ll assign the results to an object “worm.chisq.results”:
Have a look at the output:
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 69.756, df = 2, p-value = 7.124e-16
Although only a few bits of information are provided here, the object actually contains a lot more information.
Don’t try getting an overview of the object using our usual skim_without_charts
approach.. that won’t work.
Instead, simply use this code:
## [1] "statistic" "parameter" "p.value" "method" "data.name" "observed"
## [7] "expected" "residuals" "stdres"
As you can see, one of the names is expected
. This is what holds our expected frequencies (and note that these values do not need to be whole numbers, unlike the observed frequencies):
fate | uninfected | lightly | highly | |
---|---|---|---|---|
eaten | eaten | 17.02128 | 15.31915 | 15.65957 |
not eaten | not eaten | 32.97872 | 29.68085 | 30.34043 |
We see that all our assumptions are met: none of the cells (cross-classified categories) in the table have an expected frequency of less than one, and no more than 20% of the cells have expected frequencies less than five.
12.5.5 Get the results of the test
We can see the results of the \(\chi\)2 test by simply typing the name of the results object:
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 69.756, df = 2, p-value = 7.124e-16
This shows a very large value of \(\chi\)2 (69.76) and a very small P-value - much smaller than our stated \(\alpha\). So we reject the null hypothesis.
Concluding statement
The probability of being eaten is significantly associated with the level of trematode parasitism (\(\chi\)2 contingency test; df = 2; \(\chi\)2 = 69.76; P < 0.001). Based on our mosaic plot (Fig. X), the probability of being eaten increases substantially with increasing intensity of parasitism.