7.2 Describing a categorical variable
The proportion is the most important descriptive statistic for a categorical variable. It measures the fraction of observations in a given category within a categorical variable.
For example, the birds.csv
file has a single variable called type
that includes tallies (frequencies) of each of four categories of bird observed at a marsh habitat.
## # A tibble: 86 × 1
## type
## <chr>
## 1 Waterfowl
## 2 Predatory
## 3 Predatory
## 4 Waterfowl
## 5 Shorebird
## 6 Waterfowl
## 7 Waterfowl
## 8 Songbird
## 9 Predatory
## 10 Waterfowl
## # ℹ 76 more rows
The proportion of birds belonging to a given category is the same as the relative frequency of birds belonging to a given category.
In a previous tutorial, using the tigerdeaths
dataset, we learned how to create a frequency table that included relative frequencies.
Let’s use the same approach for the birds
dataset. First we create the frequency table, then we display the table with an appropriate heading:
birds.table <- birds %>%
count(type, sort = TRUE) %>%
mutate(relative_frequency = n / sum(n)) %>%
adorn_totals()
NOTE If there are missing values (“NA”) in the categorical variable, the preceding code will successfully enumerate those and create an “NA” category in the frequency table.
Now display the table:
birds.table %>%
kable(caption = "Frequency table showing the frequencies of each of four types of bird observed at a marsh habitat (N = 86)", digits = 3)
type | n | relative_frequency |
---|---|---|
Waterfowl | 43 | 0.500 |
Predatory | 29 | 0.337 |
Shorebird | 8 | 0.093 |
Songbird | 6 | 0.070 |
Total | 86 | 1.000 |
We can see, for example, that the proportion (relative frequency) of birds belonging to the “Predatory” category was 0.3372093.
We calculate proportions (relative frequencies) using the simple formula:
\[\hat{p} = \frac{n_i}{N}\] Where \[n_i\] is the frequency of observations in the given category of interest i, and N is total number of observations (sample size) across all categories.
Reminder Proportions, and thus relative frequencies, must be between 0 and 1.