In this tutorial you’ll learn how to calculate descriptive statistics for a numerical variable grouped according to categories of a categorical variable.
For example, a common scenario in biology is to want to calculate and report the mean and standard deviation of a response variable for different “treatment groups” in an experiment. (More commonly we would report the mean and standard error, but that’s for a later tutorial!).
It is straightforward to modify the code we used in the preceding tutorial to do what we want.
Specifically, we use the
group_by function from the
dplyr package to tell R to do the calculations on the observations within each category of the grouping variable.
For example, let’s describe penguin body mass grouped by “species”.
We’ll create a new tibble object called “penguins.descriptors.byspecies”, and we insert one line of code using the
group_by function, and telling R which categorical variable to use for the grouping (here, “species”):
<- penguins %>% penguins.descriptors.byspecies group_by(species) %>% summarise( Mean = mean(body_mass_g, na.rm = T), SD = sd(body_mass_g, na.rm = T), Median = median(body_mass_g, na.rm = T), InterQR = IQR(body_mass_g, na.rm = T), Count = n() - naniar::n_miss(body_mass_g), Count_NA = naniar::n_miss(body_mass_g))
It’s that simple!
Let’s have a look at the output:
## # A tibble: 3 × 7 ## species Mean SD Median InterQR Count Count_NA ## <fct> <dbl> <dbl> <dbl> <dbl> <int> <int> ## 1 Adelie 3701. 458.6 3700 650 151 1 ## 2 Chinstrap 3733. 384.3 3700 462.5 68 0 ## 3 Gentoo 5076. 504.1 5000 800 123 1
- Use the
kablefunction to output this new tibble in a nice format