7.2 Describing a categorical variable

The proportion is the most important descriptive statistic for a categorical variable. It measures the fraction of observations in a given category within a categorical variable.

For example, the birds.csv file has a single variable called type that includes tallies (frequencies) of each of four categories of bird observed at a marsh habitat.

birds
## # A tibble: 86 × 1
##    type     
##    <chr>    
##  1 Waterfowl
##  2 Predatory
##  3 Predatory
##  4 Waterfowl
##  5 Shorebird
##  6 Waterfowl
##  7 Waterfowl
##  8 Songbird 
##  9 Predatory
## 10 Waterfowl
## # ℹ 76 more rows

The proportion of birds belonging to a given category is the same as the relative frequency of birds belonging to a given category.

In a previous tutorial, using the tigerdeaths dataset, we learned how to create a frequency table that included relative frequencies.

Let’s use the same approach for the birds dataset. First we create the frequency table, then we display the table with an appropriate heading:

birds.table <- birds %>%
  count(type, sort = TRUE) %>% 
  mutate(relative_frequency = n / sum(n)) %>% 
  adorn_totals()

NOTE If there are missing values (“NA”) in the categorical variable, the preceding code will successfully enumerate those and create an “NA” category in the frequency table.

Now display the table:

birds.table %>%
  kable(caption = "Frequency table showing the frequencies of each of four types of bird observed at a marsh habitat (N = 86)", digits = 3)
Table 7.1: Frequency table showing the frequencies of each of four types of bird observed at a marsh habitat (N = 86)
type n relative_frequency
Waterfowl 43 0.500
Predatory 29 0.337
Shorebird 8 0.093
Songbird 6 0.070
Total 86 1.000

We can see, for example, that the proportion (relative frequency) of birds belonging to the “Predatory” category was 0.3372093.

We calculate proportions (relative frequencies) using the simple formula:

\[\hat{p} = \frac{n_i}{N}\] Where \[n_i\] is the frequency of observations in the given category of interest i, and N is total number of observations (sample size) across all categories.

Reminder Proportions, and thus relative frequencies, must be between 0 and 1.