9.5 Standard error of the mean

This is the formula for calculating the standard error of the mean when the population parameter is unknown (which is usually the case):

\[SE_\bar{Y} = \frac{s}{\sqrt{n}}\]

This is one measure of uncertainty that we report alongside our sample-based estimate of the population mean.

It represents the standard deviation of the sampling distribution of the mean. Thus, it is a measure of spread for the sampling distribution of the mean.

Calculating the SEM in R

To illustrate how to calculate the SEM, we’ll make use of the gene length dataset again.

Let’s take a random sample of 20 genes, making sure to set the seed this time so everyone gets the same result. We’ll use a seed number of 29, and we’ll save our sample of values from the variable “size” in an object (tibble) called “newsamp.n20”:

set.seed(29)
newsamp.n20 <- genelengths %>%
  slice_sample(n = 20, replace = FALSE) %>%
  select(size)

Now let’s show the code for calculating the SEM for the mean calculated using values in the “size” variable from our “newsamp.n20” object. We’ll output stats to a new object “newsamp.n20.stats” then display the object using kable.

newsamp.n20.stats <- newsamp.n20 %>%
  summarise(
    Count = n() - naniar::n_miss(size),
    Mean_genelength = mean(size, na.rm = TRUE),
    SD_genelength = sd(size, na.rm = TRUE),
    SEM = SD_genelength/sqrt(Count)
  )

You’ve seen all but the last lines of code before! Importantly, as we’ve seen before, the “Count” variable, calculatedusing the n function in combinatin with the n_miss function from the naniar package, tallies the number of non-missing observations in the variable of interest (here, “size”). This is the sample size that we’ll need for the calculation of the SEM.

The last line creates a new variable “SEM” that is calculated using the formula shown above, and inputs from the previous lines’ calculations.

Specifically, the sample size “n” value comes from the “Count” variable, and the “s” value comes from the “SD_genelength” variable. And lastly, we use the sqrt function to take the square-root of the sample size.

Display the output:

kable(newsamp.n20.stats, digits = 4)

Count	Mean_genelength	SD_genelength	SEM
20	3563.7	3747.552	837.9782

If you wish to calculate and report only the mean and SEM for a variable (here, the “size” variable from our tibble “newsamp.n20”):

newsamp.n20.stats.short <- newsamp.n20 %>%
  summarise(
    Mean_genelength = mean(size, na.rm = TRUE),
    SEM = sd(size, na.rm = TRUE)/sqrt(n() - naniar::n_miss(size))
  )

Display the output:

kable(newsamp.n20.stats.short, digits = 4)

Mean_genelength	SEM
3563.7	837.9782

TIP It is a good idea to use the longer approach that reports the sample size (“Count”), mean, standard deviation, and SEM. Why? Because in this way you’re reporting each value (n and s) that goes into the calculation of the SEM.