Numeric variables are described with measures of centre and spread.
Before calculating descriptive statistics for a numeric variable, it is advisable to visualize its frequency distribution first. Why? Because characteristics of the frequency distribution will govern which measures of centre and spread are more reliable or representative.
It is often the case, however, that all four measures are presented together.
dplyr package, which is loaded with the
tidyverse, has a handy
summarize) function for calculating descriptive statistics.
Check out its help file by copying the following code into your command console:
Let’s use the
penguins dataset for our demonstrations.
The first step is to visualize the frequency distribution. Given that this is a numeric variable, we do this using a histogram, as we learned in a previous tutorial.
ggplot(data = penguins, aes(x = body_mass_g)) + geom_histogram(binwidth = 500, colour = "black", fill = "lightgrey") + xlab("Body mass (g)") + ylab("Frequency") + theme_bw()
We are reminded that the distribution of body mass is moderately positively skewed and thus asymmetric, with a single mode near 3500g. There are no obvious outliers in the distribution.
This means that the median and IQR should be the preferred descriptors of centre and spread, respectively.
So let’s calculate the median and IQR of body mass for all penguins. Let’s provide the code, then explain after:
penguins %>% summarise( median_body_mass_g = median(body_mass_g), IQR_body_mass_g = IQR(body_mass_g) )
Uh oh! If you tried to run this code, it would have given you an error:
We forgot that when we previously got an overview of the
penguins dataset we discovered there were missing values (“NA” values)!
If there are “NA” values in the variable being analyzed, some R functions, such as the function
mean, will simply return “NA”. To remedy this, we use the “na.rm = TRUE” argument.
Let’s try our code again, adding the “na.rm = TRUE” argument. And note that the key functions called within the
summarise function are
IQR (case sensitive!).
%>% penguins summarise( Median = median(body_mass_g, na.rm = TRUE), InterQR = IQR(body_mass_g, na.rm = TRUE) )
## # A tibble: 1 × 2 ## Median InterQR ## <dbl> <dbl> ## 1 4050 1200
In the preceding code chunk, we have:
- The name of the tibble (here
penguins) being used in the subsequent functions
- A pipe “%>% to tell R we’re not done coding
summarizewill work too), telling R we’re going to calculate a new variable
- The name we’ll give to the first variable we’re creating, here we call the variable “Median” (the “M” is capitalized to distinguish this variable name from the function
- And we define how to calculate the “Median”, here using the
- We feed the variable of interest from the
penguinstibble, “body_mass_g”, to the
medianfunction, along with the argument “na.rm = TRUE”
- We end the line with a comma, telling R that we’re not done providing arguments to the
- We do the same for the inter-quartile range variable we’re creating called “InterQR”, calculating the value using the
IQRfunction, and this time no comma at the end of the line, because this is the last argument being provided to the
- We close out the parentheses for the
Although the median and IQR are the preferred descriptors for the
body_mass_g variable, it is nonetheless commonplace to report the mean and standard deviation also.
Let’s do this, and while we’re at it, include even more descriptors to illustrate how they’re calculated.
This time we’ll put the output from our
summarise function into a table, and then present it in a nice format, like we learned how to do for a frequency table.
Let’s create the table of descriptive statistics first, a tibble called “penguins.descriptors”, and we’ll describe what’s going on after (NOTE this code chunk was edited slightly on Sept. 30, 2021):
<- penguins %>% penguins.descriptors summarise( Mean = mean(body_mass_g, na.rm = T), SD = sd(body_mass_g, na.rm = T), Median = median(body_mass_g, na.rm = T), InterQR = IQR(body_mass_g, na.rm = T), Count = n() - naniar::n_miss(body_mass_g), Count_NA = naniar::n_miss(body_mass_g))
The first 4 descriptive statistics are self-explanatory based on their variable names.
The last two: “Count” and “Count_NA” are providing the total number of complete observations in the
body_mass_g variable (thus the number of observations that went into calculating the descriptive statistics), and then the total number of missing values (NAs) in the variable, respectively.
The last two lines of code above require further explanation:
Count = n() - naniar::n_miss(body_mass_g)) tells R to first tally the total sample size using the
n() function, then to subtract from that the total number of missing values, which is calculated using the
n_miss function from the
The double colons in
naniar::n_miss(body_mass_g) indicates that the function
n_miss comes from the
naniar package. This syntax, which we have not used previously, provides a failsafe way to run a function even if the package is not presently loaded.
The same coding approach is used in the last line:
Count_NA = naniar::n_miss(body_mass_g).
TIP It is important to calculate the total number of complete observations in the variable of interest, because, as described in the Biology Procedures and Guidelines document, this number needs to be reported in figure and table headings.
Now let’s show the table of descriptive statistics, using the
kable function we learned about in a previous tutorial.
%>% penguins.descriptors kable(caption = "Descriptive statistics of measurements of body mass (g) for 342 penguins", digits = 3)
In another tutorial we’ll learn how to present the table following all the guidelines in the Biology Guidelines and Procedures document, including, for example, significant digits. For now, the preceding table is good!
- Descriptive statistics: Create a histogram and table of descriptive statistics for the “flipper_length_mm” variable in the