## 11.2 Estimating proportions

Recall that the key descriptor for a categorical variable is a **proportion**. And when we wish to draw inferences about a categorical attribute within a population of interest, we take a random sample from the population to *estimate* the true proportion \({p}\) of the population with that attribute. Our estimate is denoted \(\hat{p}\).

In a previous tutorial we used the “birds” dataset to practice creating a “frequency table” that showed the *relative frequencies* of birds falling in each category of the categorical variable “type”. These relative frequencies are equivalent to their respective proportions.

So, for the birds dataset, let’s produce a frequency table, but instead of using the term “relative frequency” we’ll use “p_hat” to denote that this is our estimate of the proportion of birds in that category.

We’ll put the output in an object “bird.freq.table” then we’ll display it after:

```
<- birds %>%
bird.freq.table count(type, sort = TRUE) %>%
mutate(p_hat = n / sum(n))
```

Now show the table:

`kable(bird.freq.table, digits = 4)`

type | n | p_hat |
---|---|---|

Waterfowl | 43 | 0.5000 |

Predatory | 29 | 0.3372 |

Shorebird | 8 | 0.0930 |

Songbird | 6 | 0.0698 |

We can see that the proportion of birds at the marsh that belong to the “Predatory” category is about 0.34. Assuming we had a good random sample, this is our best estimate of the true proportion of marsh birds that are predatory.

Being an estimate, however, we need to attach some measure of uncertainty to it.

In a previous tutorial we learned how to calculate the standard error and rule-of-thumb 95% confidence interval for the **mean**.

Here we’ll learn how to calculate these measures of uncertainty for a **proportion**.

### 11.2.1 Standard error for a proportion

Here’s the equation for the standard error for the proportion:

\[SE_\hat{p} = \sqrt\frac{\hat{p}(1-\hat{p})}{n}\]

Let’s use the “birds” dataset to demonstrate the calculation.

Let’s re-create the table again:

```
<- birds %>%
bird.freq.table count(type, sort = TRUE) %>%
mutate(p_hat = n / sum(n))
```

Show the table:

`kable(bird.freq.table, digits = 4)`

type | n | p_hat |
---|---|---|

Waterfowl | 43 | 0.5000 |

Predatory | 29 | 0.3372 |

Shorebird | 8 | 0.0930 |

Songbird | 6 | 0.0698 |

That table does show us our estimate of the proportion of birds that are predatory. We’ll isolate that value later.

But we also need to get the total sample size, i.e. the total number of birds sampled.

To do this, we need to make sure not to count any missing values, and we can use the `n_complete`

function from the `naniar`

package:

```
<- birds %>%
birds.sampsize select(type) %>%
n_complete()
```

In the preceding code:

- We first ensure the output will be put in a new object called “birds.sampsize”
- We provide the tibble
`birds`

, which is what is fed to the subsequent code - We then use
`select`

to select the variable of interest “type” - We then use the
`n_complete`

function to get the total number of non-missing observations in the variable.

Now let’s see what the outcome was:

` birds.sampsize`

`## [1] 86`

In previous tutorials we used a bit more cumbersome way to calculate the total number of complete observations in a variable. The `n_complete`

function will simplify things.

Now that we have *n* (above), stored in the object “birds.sampsize”, we need to extract the “p_hat” value associated with the “Predatory” category, and we’ll store this in an object called “pred.phat”:

```
<- bird.freq.table %>%
pred.phat filter(type == "Predatory") %>%
select(p_hat)
```

In the preceding chunk we:

- assign our output to an object “pred.phat”
- use the
`filter`

function to get the rows where the “type” variable is equal to the “Predatory” category - use the
`select`

function to return (select) the “p_hat” variable only

Let’s look at what this produced:

` pred.phat`

```
## # A tibble: 1 × 1
## p_hat
## <dbl>
## 1 0.3372
```

So this produced a tibble called “pred.phat” with a variable “p_hat”, and it has one value - our proportion estimate \(\hat{p}\).

**TIP**: We could have calculated and isolated the “P-hat” value for predatory birds all in one go, using the following code:

```
<- birds %>%
pred.phat count(type, sort = TRUE) %>%
mutate(p_hat = n / sum(n)) %>%
filter(type == "Predatory") %>%
select(p_hat)
```

That demonstrates the power of the pipe approach!

OK, now we’re ready to calculate \(SE_\hat{p} = \sqrt\frac{\hat{p}(1-\hat{p})}{n}\)

Here’s how we do this using basic R syntax, recalling that our “n” (86 birds total) is stored in the object “birds.sampsize”.

We’ll store the value in a new object called “SE_phat”:

`<- sqrt(pred.phat * (1 - pred.phat) / birds.sampsize) SE_phat `

And now see what the value is:

` SE_phat`

```
## p_hat
## 1 0.0509787
```

**Reporting the proportion and standard error**

To properly report a proportion estimate along with its standard error, we need to know how to produce a “plus/minus” symbol in markdown.

Here’s what you type in the regular text area of your markdown document (not in a code chunk).

`$\pm$`

The `$\pm$`

is the syntax for a plus-minus symbol. More symbols can be found at this website.

And so you insert that text in between your estimate \(\hat{p}\) and the \(SE_\hat{p}\), as such: 0.34 \(\pm\) 0.051.

### 11.2.2 Confidence interval for a proportion

Although a number of options are available for calculating a confidence interval for a proportion, we’ll use the *Agresti-Coull* method, as it has desirable properties:

\[p' \pm 1.96\cdot \sqrt\frac{p'(1-p')}{N+4}\] where \[p' = \frac{X+2}{n+4}\]

The **margin of error** from the above equation is this part:

\[1.96\cdot \sqrt\frac{p'(1-p')}{N+4}\]

This margin of error is the value that we *subtract* to our proportion estimate \(\hat{p}\) to get the **lower 95% confidence limit**, and we *add* that value to our proportion estimate \(\hat{p}\) to get the **upper 95% confidence limit**.

We’ll again use the “birds” data, and the estimated proportion of birds that are predatory.

Let’s again calculate the sample size of birds:

```
<- birds %>%
birds.sampsize select(type) %>%
n_complete()
```

And here’s the frequency table we constructed:

```
<- birds %>%
bird.freq.table count(type, sort = TRUE) %>%
mutate(p_hat = n / sum(n))
```

Show the table:

` bird.freq.table`

```
## # A tibble: 4 × 3
## type n p_hat
## <chr> <int> <dbl>
## 1 Waterfowl 43 0.5
## 2 Predatory 29 0.3372
## 3 Shorebird 8 0.09302
## 4 Songbird 6 0.06977
```

Looking at the formula for the *Agresti-Coull* confidence interal, we see that we need *X* - the number of “successes”, which in our case is the number of birds in the category “Predatory”. This is provided in our table above. So let’s extract that information.

We’ll store our value of “X” in a new object called “bird.X”:

```
<- bird.freq.table %>%
bird.X filter(type == "Predatory") %>%
select(n)
```

Now let’s see the output:

` bird.X`

```
## # A tibble: 1 × 1
## n
## <int>
## 1 29
```

Unlike the standard error calculation, for which we did actual math using R, for the confidence interval we’ll make use of the `binom.confint`

function from the `binom`

package. Have a look at the help page:

`?binom.confint`

The function takes a handful of arguments, including the value of *X*, the number of trials *n*, the confidence level we wish to use (typically 0.95 corresponding to a 95% confidence interval), and the method one wishes to use.

Let’s try it out using our bird data.

The one catch in the code below is that we can’t simply provide the name of our “bird.X” object to provide the value of *X* for the `binom.confint`

function; instead we need to specify that the value we want to use is stored in the variable called “n” within the “bird.X” object. Thus, we use `bird.X$n`

. The dollar sign allows us to specify a specific variable within the tibble.

Then, we use “ac” to specify the “Agresti-Coull” methods.

We’ll assing our output to a new object called “confint.results”:

`<- binom.confint(x = bird.X$n, n = birds.sampsize, conf.level = 0.95, methods = "ac") confint.results `

Have a look at the output using the `kable`

function to make it nicer looking:

`kable(confint.results, digits = 4)`

method | x | n | mean | lower | upper |
---|---|---|---|---|---|

agresti-coull | 29 | 86 | 0.3372 | 0.2459 | 0.4424 |

The function provides us with the values of *X*, *n*, the proportion estimate (strangely called “mean”), and the lower and upper confidence limits.

**Reporting the confidence interval for a proportion**

The appropriate way to report the confidence interval is as follows:

The 95% Agresti-Coull confidence interval is: 0.246 \(< {p} <\) 0.442.

- What is the best estimate of the proportion of the marsh birds that belong to the “Shorebird” category? What is the standard error of the proportion, and the 95% Agresti-Coull confidence interval?