Which Statistical Test to Use

In this section we will cover the following inferential statistical tests

Student's t-test
Analysis of Variance
Chi squared (\(\chi{^2}\)) goodness-of-fit
Chi squared (\(\chi{^2}\)) contingency test

Comparing Means Among Treatment Groups

Scenario 1: Continuous numeric Dependent Variable and Categorical Independent Variable

This scenario is typical of measured response experiments like our plant experiment.

In this scenario the typical approach is to compare the average value (the mean) of your dependent variable among the categories of the categorical independent variable (that is, the treatment groups).

Our approach aims to determine whether any of the treatment groups' means differ "significantly" from any of the other group means. At first glance, that seems straightforward: If the mean height of plants subjected to 30 degrees was 22.2cm, and the mean height of plants subjected to 20 degrees was 18.2cm, then clearly the warmer temperature treatment yielded a significantly greater average height, right?

We must remember: we're now conducting inferential statistics, so each of the treatment group means has uncertainty associated with it owing to sampling error, and that uncertainty should cause us some pause. We need to be convinced, with some level of confidence, that the difference we observe between the group means could not simply have arisen due to sampling error, but rather was most likely due to our experimental treatments.

We start by writing down our statistical null and alternative hypotheses:

H₀: The average height of plants grown under different temperatures does not differ after 14 days

H_A: The average height of plants grown under different temperatures does differ after 14 days

Next, as instructed in the previous section, we need to decide what "significance level" we wish to use. You'll learn in BIOL202 that there are many factors that influence this decision. For now, we'll follow the standard and use a 5% significance level. In short, this means that we're willing to make a false positive mistake in our conclusion at most about 5% of the time, on average.

Next we need to ask: how many categories (treatment groups) do we have in our categorical independent variable?

If there are only 2 categories (one of which should be a control group), then we perform something called a "Student's t-test".

If there are more than 2 categories then we conduct an "Analysis of Variance" (ANOVA). Yes – a strange name for a test that analyzes means!

These tests, like all statistical tests, have some assumptions associated with them, and you'll learn more about those in BIOL202. For now, you will practice implementing these tests using a Shiny App.

As you'll see when you practice in the Shiny app, the test statistic provided by the Student's t-test is, you guessed it, a "t" statistic. If the value of "t" that is calculated using your experiment data exceeds or is equal to the "critical value" of "t", then the P-value provided by the computer will be less than or equal to 0.05 (assuming this was the significance level you chose), and you would reject the null hypothesis in favour of the alternative, and conclude that the average height of plants grown under different temperatures does differ after 14 days. If, on the other hand, your calculated value of "t" is less than the critical value, and correspondingly the P-value is greater than 0.05, then you fail to reject the null hypothesis and conclude that, at present, the data are consistent with the conclusion of no difference among average heights of plants grown under different temperatures. Remember, this doesn't necessarily mean the null hypothesis is true! We simply don’t have any evidence to suggest it's false. The same procedure would be used for the ANOVA test, but in this case you evaluate the "F" statistic.

Comparing Frequencies to a Baseline Expectation

Scenario 2: Categorical Dependent Variable Without an Independent Variable

This scenario is typical of choice experiments, like the one described above with mice choosing among food types. The single categorical response variable is "food type". You can imagine that if there was a strong preference for one of the three food types, then that food type would be chosen more frequently by the mice than the others. Alternatively, if there was no preference, then one would expect all three food types to be selected with similar frequency.

As skeptical scientists, this latter "status quo" scenario should be our working hypothesis, and evidence would need to be strong and clear to convince us otherwise. We need a way to quantitatively test this.

For this scenario in which we have a single categorical dependent variable, we use something called a \(\chi{^2}\) goodness-of-fit test (when we say it, we often call it the chi-squared test, and pronounce the Greek letter, \(\chi\), as Kai), which quantifies the "fit" of observed frequencies to those expected if nothing were going on. This test uses the \(\chi{^2}\) test statistic.

We now formulate suitably worded null and alternative hypotheses:

H₀: The three food types are chosen with equal frequency by the mice (or something similarly clear).

H_A: The three food types are NOT chosen with equal frequency by the mice (or something similarly clear).

Although the test is relatively straightforward to undertake, you can make use of shiny app to do the test. In brief, as the overall difference between your observed frequencies (from the experiment) and those expected by the null expectation increase, the value of your calculated \(\chi{^2}\) will increase. If it increases in magnitude to the point that it equals or exceeds the "critical value" of \(\chi{^2}\) that you would have established before, then the P-value will be less than or equal to 0.05, and you would reject the null hypothesis in favour of the alternative.

If there are only 2 categories in the dependent variable, then the most powerful statistical test to use is a binomial test, but a \(\chi{^2}\) goodness-of-fit test will still work.

Scenario 3: Categorical Dependent Variable and one Categorical Independent Variable

This scenario is also typical of "choice experiments", and above we provided one example in which we hypothesized that female mice showed a food preference whereas males do not. In this case, we plan to conduct something called a \(\chi{^2}\) contingency test, also called a \(\chi{^2}\) test of association. For example, if indeed we were correct with our research hypothesis, then the evidence would show that a preference for food type is contingent on the sex of the mouse.

The appropriate null and alternative hypotheses are:

H₀: The three food types are chosen with equal frequency by male and female mice.

H_A: The three food types are not chosen with equal frequency by male and female mice.

An alternative but less effective wording that is common to see is:

H₀: There is no association between food preference and sex.

H_A: There is an association between food preference and sex.

The latter statements are more ambiguous with respect to quantitative predictions. Nevertheless, they are acceptable.

Again, we plan what significance level to use: 5% or 0.05. Based on this significance level, and on the number of categories in our dependent and independent variables, we would figure out what the critical value of \(\chi{^2}\) is for our test. But in our case, we'll again use an online app for the test.

If our calculated value of \(\chi{^2}\) is greater than the critical value of \(\chi{^2}\), then we reject the null hypothesis in favour of the alternative, and conclude that "Food preference is contingent on the sex of the mouse, because males and females chose the food types with different frequencies."

For more details on how to report the results of statistical tests, refer to the UBCO Biology Guidelines for data presentation.