7.4 Describing a numerical variable grouped by a categorical variable

In this tutorial you’ll learn how to calculate descriptive statistics for a numerical variable grouped according to categories of a categorical variable.

For example, a common scenario in biology is to want to calculate and report the mean and standard deviation of a response variable for different “treatment groups” in an experiment. (More commonly we would report the mean and standard error, but that’s for a later tutorial!).

It is straightforward to modify the code we used in the preceding tutorial to do what we want.

Specifically, we use the group_by function from the dplyr package to tell R to do the calculations on the observations within each category of the grouping variable.

For example, let’s describe penguin body mass grouped by “species”.

We’ll create a new tibble object called “penguins.descriptors.byspecies”, and we insert one line of code using the group_by function, and telling R which categorical variable to use for the grouping (here, “species”):

penguins.descriptors.byspecies <- penguins %>%
  group_by(species) %>%
  Mean = mean(body_mass_g, na.rm = T),
  SD = sd(body_mass_g, na.rm = T),
  Median = median(body_mass_g, na.rm = T),
  InterQR = IQR(body_mass_g, na.rm = T),
  Count = n() - naniar::n_miss(body_mass_g),
  Count_NA = naniar::n_miss(body_mass_g))

It’s that simple!

Let’s have a look at the output:

## # A tibble: 3 × 7
##   species    Mean    SD Median InterQR Count Count_NA
##   <fct>     <dbl> <dbl>  <dbl>   <dbl> <int>    <int>
## 1 Adelie    3701. 458.6   3700   650     151        1
## 2 Chinstrap 3733. 384.3   3700   462.5    68        0
## 3 Gentoo    5076. 504.1   5000   800     123        1
  1. Use the kable function to output this new tibble in a nice format