7.4 Describing a numerical variable grouped by a categorical variable
In this tutorial you’ll learn how to calculate descriptive statistics for a numerical variable grouped according to categories of a categorical variable.
For example, a common scenario in biology is to want to calculate and report the mean and standard deviation of a response variable for different “treatment groups” in an experiment. (More commonly we would report the mean and standard error, but that’s for a later tutorial!).
It is straightforward to modify the code we used in the preceding tutorial to do what we want.
Specifically, we use the group_by
function from the dplyr
package to tell R to do the calculations on the observations within each category of the grouping variable.
For example, let’s describe penguin body mass grouped by “species”.
We’ll create a new tibble object called “penguins.descriptors.byspecies”, and we insert one line of code using the group_by
function, and telling R which categorical variable to use for the grouping (here, “species”):
penguins.descriptors.byspecies <- penguins %>%
group_by(species) %>%
summarise(
Mean = mean(body_mass_g, na.rm = T),
SD = sd(body_mass_g, na.rm = T),
Median = median(body_mass_g, na.rm = T),
InterQR = IQR(body_mass_g, na.rm = T),
Count = n() - naniar::n_miss(body_mass_g),
Count_NA = naniar::n_miss(body_mass_g))
It’s that simple!
Let’s have a look at the output:
## # A tibble: 3 × 7
## species Mean SD Median InterQR Count Count_NA
## <fct> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 Adelie 3701. 458.6 3700 650 151 1
## 2 Chinstrap 3733. 384.3 3700 462.5 68 0
## 3 Gentoo 5076. 504.1 5000 800 123 1
- Use the
kable
function to output this new tibble in a nice format