## 9.2 Functions for sampling

To illustrate the concepts of “sampling error” and the “sampling distribution of the mean”, we’re going to make use of datasets that include measurements for all individuals in a “population” of interest, and we’re going to take random samples of observations from those datasets. This mimics the process of sampling at random from a population.

There are two functions that are especially helpful for taking random samples from objects.

- The
`sample`

function in the base R package is easy to use for taking a random sample from a**vector**of values. Look at the help function for it:

`?sample`

The second useful function is the `slice_sample`

function from the `dplyr`

package (loaded with `tidyverse`

). This function takes random samples of rows from a **dataframe** or **tibble**.

`?slice_sample`

Note that both functions have a `replace`

argument that dictates whether the sampling we conduct occurs with or without replacement. For instance, the following code will not work (try copying and pasting into your console):

`sample(1:6, size = 10, replace = F)`

Here we’re telling R to sample 10 values from a vector that includes only 6 values (one through six), and with the `replace = F`

argument we told R not to put the sampled numbers back in the pile for future sampling.

If instead we specified `replace = T`

, as shown below, then we be conducting *sampling with replacement*, and sampled values are placed back in the pool each time, thus enabling the code to work (try it):

`sample(1:6, size = 10, replace = T)`

Whether we wish to *sample without replacement* or not depends on the question / context (see page 138 in Whitlock & Schluter). For most of the questions and contexts we deal with in this course, we do wish to sample **with** replacement. Why? Because we are typically assuming (or mimicking a situation) that we’re randomly sampling from a very large population, in which case the probability distribution of possible values in the individuals that remain doesn’t change when we sample. This is not the case if we’re sampling from a small population.

### 9.2.1 Setting the “seed” for random sampling

Functions such as `sample`

and `slice_sample`

have a challenge to meet: they must be able to take samples that are as as random as possible. On a computer, this is easier said then done; special algorithms are required to meet the challenge. R has some pretty good ones available, so we don’t have to worry about it (though to be honest, they can only generate “pseudorandom” numbers, but that’s good enough for our purposes!).

However, we DO need to worry about *computational reproducibility*.

Imagine we have authored a script for a research project, and the script makes use of functions such as `sample`

. Now imagine that someone else wanted to re-run our analyses on their own computer. Any code chunks that included functions such as `sample`

would produce different results for them, because the function takes a different random sample each time it’s run!

Thankfully there’s a way to ensure that scripts implementing random sampling can be computationally reproducible: we can use the `set.seed`

function.

This assumes you are using R version 4 or later (e.g. 4.1). The `set.seed`

function used a different default algorithm in versions of R prior to version 4.0.

Let’s provide the code chunk then explain after:

`set.seed(12)`

The `set.seed`

function requires an integer value as an argument. You can pick any number you want as the seed number. Here we’ll use 12, and if you do too, you’ll get the same results as in the tutorial.

Let’s test this using the `runif`

function, which generates random numbers between user-designated minimum and maximum values (the default is numbers between zero and one).

Let’s include the `set.seed`

function right before using the `runif`

function, and here we’ll again use a seed number of 12.

```
set.seed(12)
runif(3)
```

`## [1] 0.06936092 0.81777520 0.94262173`

Here we told the `runif`

function to generate three random numbers, and it used the default minimum and maximum values of zero and one.

Now let’s try a different seed:

```
set.seed(25)
runif(3)
```

`## [1] 0.4161184 0.6947637 0.1488006`

So long as you used the same seed number, you should have gotten the same three random numbers as shown above.

When you are authoring your own script or markdown document (e.g. for a research project), it is advisable to only set the seed once, at the beginning of your script or markdown document. When you run or knit your completed script/markdown, the seed will be set at the beginning, the code chunks will run in sequence, and each chunk that uses a random number generator will do so in a predictable way (based on the seed number). The script will be computationally reproducible. In contrast, when we’re trying out code (e.g. when working on tutorial material), we aren’t running a set sequence or number of code chunks, so we can’t rely on everyone getting the same output for a particular code chunk. For this reason you’ll see in the tutorial material that many of the code chunks that require use of a random number generator will include a `set.seed`

statement, to ensure that everyone gets the same output from that specific code chunk (this isn’t always required, but often it is).

- Generate 10 random numbers using the
`runif`

function, and first set the seed number to 200.