9.3 Sampling error
Sampling error is the chance difference, caused by sampling, between an estimate and the population parameter being estimated
Here we’ll get a feel for sampling error using the human gene data.
Let’s use the slice_sample
function to randomly sample n = 20 rows from the genelengths
tibble, and store them in a tibble object called “randsamp1”:
In the preceding chunk, we:
- set the seed (here using integer 12), so that everyone gets the same output for this code chunk
- tell R that we want the output from our code to be stored in an object named “randsamp1”
- use the
slice_sample
function to randomly sample 20 rows from thegenelengths
tibble, and to do so without replacement - use the
select
function to tell R to return only the “size” variable from the newly generated (sampled) tibble
Now let’s use the mean
and sd
base functions to calculate the mean and standard deviation using our sample.
We’ll assign the calculations to an object “randsamp1.mean.sd”, then we’ll present the output using the kable
function.
randsamp1.mean.sd <- randsamp1 %>%
summarise(
Mean_genelength = mean(size, na.rm = TRUE),
SD_genelength = sd(size, na.rm = TRUE)
)
Now present the output using the kable
function so we can control the maximum number of digits shown.
Using the kable
approach to presenting tibble outputs ensures that, when knitted, your output shows a sufficient number of decimal places… something that doesn’t always happen without using the kable
function.
Mean_genelength | SD_genelength |
---|---|
3408.25 | 1443.698 |
Do your numbers match those above?
Let’s now draw another random sample of the same size (20), making sure to give the resulting tibble object a different name (“randsamp2”). Here, we won’t set the seed again, because all that is required is for everyone to get a different sample from the first one above; we don’t need to have everyone get the same sample, but it’s ok if we do.
And calculate the mean and sd:
randsamp2.mean.sd <- randsamp2 %>%
summarise(
Mean_genelength = mean(size, na.rm = TRUE),
SD_genelength = sd(size, na.rm = TRUE)
)
And show using the kable
function:
Mean_genelength | SD_genelength |
---|---|
4381.35 | 2081.471 |
Are they the same as we saw using the first random sample? NO! This reflects sampling error.
- Repeat the process above to get a third sample of size 20 genes, and calculate the mean and standard deviation of gene length for that sample.