9.2 Functions for sampling

To illustrate the concepts of “sampling error” and the “sampling distribution of the mean”, we’re going to make use of datasets that include measurements for all individuals in a “population” of interest, and we’re going to take random samples of observations from those datasets. This mimics the process of sampling at random from a population.

There are two functions that are especially helpful for taking random samples from objects.

The sample function in the base R package is easy to use for taking a random sample from a vector of values. Look at the help function for it:

?sample

The second useful function is the slice_sample function from the dplyr package (loaded with tidyverse). This function takes random samples of rows from a dataframe or tibble.

?slice_sample

Note that both functions have a replace argument that dictates whether the sampling we conduct occurs with or without replacement. For instance, the following code will not work (try copying and pasting into your console):

sample(1:6, size = 10, replace = F)

Here we’re telling R to sample 10 values from a vector that includes only 6 values (one through six), and with the replace = F argument we told R not to put the sampled numbers back in the pile for future sampling.

If instead we specified replace = T, as shown below, then we be conducting sampling with replacement, and sampled values are placed back in the pool each time, thus enabling the code to work (try it):

sample(1:6, size = 10, replace = T)

Whether we wish to sample without replacement or not depends on the question / context (see page 138 in Whitlock & Schluter). For most of the questions and contexts we deal with in this course, we do wish to sample with replacement. Why? Because we are typically assuming (or mimicking a situation) that we’re randomly sampling from a very large population, in which case the probability distribution of possible values in the individuals that remain doesn’t change when we sample. This is not the case if we’re sampling from a small population.

9.2.1 Setting the “seed” for random sampling

Functions such as sample and slice_sample have a challenge to meet: they must be able to take samples that are as as random as possible. On a computer, this is easier said then done; special algorithms are required to meet the challenge. R has some pretty good ones available, so we don’t have to worry about it (though to be honest, they can only generate “pseudorandom” numbers, but that’s good enough for our purposes!).

However, we DO need to worry about computational reproducibility.

Imagine we have authored a script for a research project, and the script makes use of functions such as sample. Now imagine that someone else wanted to re-run our analyses on their own computer. Any code chunks that included functions such as sample would produce different results for them, because the function takes a different random sample each time it’s run!

Thankfully there’s a way to ensure that scripts implementing random sampling can be computationally reproducible: we can use the set.seed function.

This assumes you are using R version 4 or later (e.g. 4.1). The set.seed function used a different default algorithm in versions of R prior to version 4.0.

Let’s provide the code chunk then explain after:

set.seed(12)

The set.seed function requires an integer value as an argument. You can pick any number you want as the seed number. Here we’ll use 12, and if you do too, you’ll get the same results as in the tutorial.

Let’s test this using the runif function, which generates random numbers between user-designated minimum and maximum values (the default is numbers between zero and one).

Let’s include the set.seed function right before using the runif function, and here we’ll again use a seed number of 12.

set.seed(12)
runif(3)

## [1] 0.06936092 0.81777520 0.94262173

Here we told the runif function to generate three random numbers, and it used the default minimum and maximum values of zero and one.

Now let’s try a different seed:

set.seed(25)
runif(3)

## [1] 0.4161184 0.6947637 0.1488006

So long as you used the same seed number, you should have gotten the same three random numbers as shown above.

When you are authoring your own script or markdown document (e.g. for a research project), it is advisable to only set the seed once, at the beginning of your script or markdown document. When you run or knit your completed script/markdown, the seed will be set at the beginning, the code chunks will run in sequence, and each chunk that uses a random number generator will do so in a predictable way (based on the seed number). The script will be computationally reproducible. In contrast, when we’re trying out code (e.g. when working on tutorial material), we aren’t running a set sequence or number of code chunks, so we can’t rely on everyone getting the same output for a particular code chunk. For this reason you’ll see in the tutorial material that many of the code chunks that require use of a random number generator will include a set.seed statement, to ensure that everyone gets the same output from that specific code chunk (this isn’t always required, but often it is).

Generate 10 random numbers using the runif function, and first set the seed number to 200.