11.2 Estimating proportions
Recall that the key descriptor for a categorical variable is a proportion. And when we wish to draw inferences about a categorical attribute within a population of interest, we take a random sample from the population to estimate the true proportion \({p}\) of the population with that attribute. Our estimate is denoted \(\hat{p}\).
In a previous tutorial we used the “birds” dataset to practice creating a “frequency table” that showed the relative frequencies of birds falling in each category of the categorical variable “type”. These relative frequencies are equivalent to their respective proportions.
So, for the birds dataset, let’s produce a frequency table, but instead of using the term “relative frequency” we’ll use “p_hat” to denote that this is our estimate of the proportion of birds in that category.
We’ll put the output in an object “bird.freq.table” then we’ll display it after:
Now show the table:
type | n | p_hat |
---|---|---|
Waterfowl | 43 | 0.5000 |
Predatory | 29 | 0.3372 |
Shorebird | 8 | 0.0930 |
Songbird | 6 | 0.0698 |
We can see that the proportion of birds at the marsh that belong to the “Predatory” category is about 0.34. Assuming we had a good random sample, this is our best estimate of the true proportion of marsh birds that are predatory.
Being an estimate, however, we need to attach some measure of uncertainty to it.
In a previous tutorial we learned how to calculate the standard error and rule-of-thumb 95% confidence interval for the mean.
Here we’ll learn how to calculate these measures of uncertainty for a proportion.
11.2.1 Standard error for a proportion
Here’s the equation for the standard error for the proportion:
\[SE_\hat{p} = \sqrt\frac{\hat{p}(1-\hat{p})}{n}\]
Let’s use the “birds” dataset to demonstrate the calculation.
Let’s re-create the table again:
Show the table:
type | n | p_hat |
---|---|---|
Waterfowl | 43 | 0.5000 |
Predatory | 29 | 0.3372 |
Shorebird | 8 | 0.0930 |
Songbird | 6 | 0.0698 |
That table does show us our estimate of the proportion of birds that are predatory. We’ll isolate that value later.
But we also need to get the total sample size, i.e. the total number of birds sampled.
To do this, we need to make sure not to count any missing values, and we can use the n_complete
function from the naniar
package:
In the preceding code:
- We first ensure the output will be put in a new object called “birds.sampsize”
- We provide the tibble
birds
, which is what is fed to the subsequent code - We then use
select
to select the variable of interest “type” - We then use the
n_complete
function to get the total number of non-missing observations in the variable.
Now let’s see what the outcome was:
## [1] 86
In previous tutorials we used a bit more cumbersome way to calculate the total number of complete observations in a variable. The n_complete
function will simplify things.
Now that we have n (above), stored in the object “birds.sampsize”, we need to extract the “p_hat” value associated with the “Predatory” category, and we’ll store this in an object called “pred.phat”:
In the preceding chunk we:
- assign our output to an object “pred.phat”
- use the
filter
function to get the rows where the “type” variable is equal to the “Predatory” category - use the
select
function to return (select) the “p_hat” variable only
Let’s look at what this produced:
## # A tibble: 1 × 1
## p_hat
## <dbl>
## 1 0.3372
So this produced a tibble called “pred.phat” with a variable “p_hat”, and it has one value - our proportion estimate \(\hat{p}\).
TIP: We could have calculated and isolated the “P-hat” value for predatory birds all in one go, using the following code:
pred.phat <- birds %>%
count(type, sort = TRUE) %>%
mutate(p_hat = n / sum(n)) %>%
filter(type == "Predatory") %>%
select(p_hat)
That demonstrates the power of the pipe approach!
OK, now we’re ready to calculate \(SE_\hat{p} = \sqrt\frac{\hat{p}(1-\hat{p})}{n}\)
Here’s how we do this using basic R syntax, recalling that our “n” (86 birds total) is stored in the object “birds.sampsize”.
We’ll store the value in a new object called “SE_phat”:
And now see what the value is:
## p_hat
## 1 0.0509787
Reporting the proportion and standard error
To properly report a proportion estimate along with its standard error, we need to know how to produce a “plus/minus” symbol in markdown.
Here’s what you type in the regular text area of your markdown document (not in a code chunk).
$\pm$
The $\pm$
is the syntax for a plus-minus symbol. More symbols can be found at this website.
And so you insert that text in between your estimate \(\hat{p}\) and the \(SE_\hat{p}\), as such: 0.34 \(\pm\) 0.051.
11.2.2 Confidence interval for a proportion
Although a number of options are available for calculating a confidence interval for a proportion, we’ll use the Agresti-Coull method, as it has desirable properties:
\[p' \pm 1.96\cdot \sqrt\frac{p'(1-p')}{N+4}\] where \[p' = \frac{X+2}{n+4}\]
The margin of error from the above equation is this part:
\[1.96\cdot \sqrt\frac{p'(1-p')}{N+4}\]
This margin of error is the value that we subtract to our proportion estimate \(\hat{p}\) to get the lower 95% confidence limit, and we add that value to our proportion estimate \(\hat{p}\) to get the upper 95% confidence limit.
We’ll again use the “birds” data, and the estimated proportion of birds that are predatory.
Let’s again calculate the sample size of birds:
And here’s the frequency table we constructed:
Show the table:
## # A tibble: 4 × 3
## type n p_hat
## <chr> <int> <dbl>
## 1 Waterfowl 43 0.5
## 2 Predatory 29 0.3372
## 3 Shorebird 8 0.09302
## 4 Songbird 6 0.06977
Looking at the formula for the Agresti-Coull confidence interal, we see that we need X - the number of “successes”, which in our case is the number of birds in the category “Predatory”. This is provided in our table above. So let’s extract that information.
We’ll store our value of “X” in a new object called “bird.X”:
Now let’s see the output:
## # A tibble: 1 × 1
## n
## <int>
## 1 29
Unlike the standard error calculation, for which we did actual math using R, for the confidence interval we’ll make use of the binom.confint
function from the binom
package. Have a look at the help page:
?binom.confint
The function takes a handful of arguments, including the value of X, the number of trials n, the confidence level we wish to use (typically 0.95 corresponding to a 95% confidence interval), and the method one wishes to use.
Let’s try it out using our bird data.
The one catch in the code below is that we can’t simply provide the name of our “bird.X” object to provide the value of X for the binom.confint
function; instead we need to specify that the value we want to use is stored in the variable called “n” within the “bird.X” object. Thus, we use bird.X$n
. The dollar sign allows us to specify a specific variable within the tibble.
Then, we use “ac” to specify the “Agresti-Coull” methods.
We’ll assing our output to a new object called “confint.results”:
confint.results <- binom.confint(x = bird.X$n, n = birds.sampsize, conf.level = 0.95, methods = "ac")
Have a look at the output using the kable
function to make it nicer looking:
method | x | n | mean | lower | upper |
---|---|---|---|---|---|
agresti-coull | 29 | 86 | 0.3372 | 0.2459 | 0.4424 |
The function provides us with the values of X, n, the proportion estimate (strangely called “mean”), and the lower and upper confidence limits.
Reporting the confidence interval for a proportion
The appropriate way to report the confidence interval is as follows:
The 95% Agresti-Coull confidence interval is: 0.246 \(< {p} <\) 0.442.
- What is the best estimate of the proportion of the marsh birds that belong to the “Shorebird” category? What is the standard error of the proportion, and the 95% Agresti-Coull confidence interval?