9.3 Sampling error

Sampling error is the chance difference, caused by sampling, between an estimate and the population parameter being estimated

Here we’ll get a feel for sampling error using the human gene data.

Let’s use the slice_sample function to randomly sample n = 20 rows from the genelengths tibble, and store them in a tibble object called “randsamp1”:

set.seed(12)
randsamp1 <- genelengths %>%
  slice_sample(n = 20, replace = FALSE) %>%
  select(size)

In the preceding chunk, we:

set the seed (here using integer 12), so that everyone gets the same output for this code chunk
tell R that we want the output from our code to be stored in an object named “randsamp1”
use the slice_sample function to randomly sample 20 rows from the genelengths tibble, and to do so without replacement
use the select function to tell R to return only the “size” variable from the newly generated (sampled) tibble

Now let’s use the mean and sd base functions to calculate the mean and standard deviation using our sample.

We’ll assign the calculations to an object “randsamp1.mean.sd”, then we’ll present the output using the kable function.

randsamp1.mean.sd <- randsamp1 %>%
  summarise(
    Mean_genelength = mean(size, na.rm = TRUE),
    SD_genelength = sd(size, na.rm = TRUE)
  )

Now present the output using the kable function so we can control the maximum number of digits shown.

Using the kable approach to presenting tibble outputs ensures that, when knitted, your output shows a sufficient number of decimal places… something that doesn’t always happen without using the kable function.

kable(randsamp1.mean.sd, digits = 4)

Mean_genelength	SD_genelength
3408.25	1443.698

Do your numbers match those above?

Let’s now draw another random sample of the same size (20), making sure to give the resulting tibble object a different name (“randsamp2”). Here, we won’t set the seed again, because all that is required is for everyone to get a different sample from the first one above; we don’t need to have everyone get the same sample, but it’s ok if we do.

randsamp2 <- genelengths %>%
  slice_sample(n = 20, replace = FALSE) %>%
  select(size)

And calculate the mean and sd:

randsamp2.mean.sd <- randsamp2 %>%
  summarise(
    Mean_genelength = mean(size, na.rm = TRUE),
    SD_genelength = sd(size, na.rm = TRUE)
  )

And show using the kable function:

kable(randsamp2.mean.sd, digits = 4)

Mean_genelength	SD_genelength
4381.35	2081.471

Are they the same as we saw using the first random sample? NO! This reflects sampling error.

Repeat the process above to get a third sample of size 20 genes, and calculate the mean and standard deviation of gene length for that sample.