16.1 Load packages and import data
Load the tidyverse
, knitr
, naniar
, car
, skimr
, and broom
packages:
For this tutorial we’ll use the “circadian.csv” dataset. These data are associated with Example 15.1 in the text.
The circadian
data describe melatonin production in 22 people randomly assigned to one of three light treatments.
circadian <- read_csv("https://raw.githubusercontent.com/ubco-biology/BIOL202/main/data/circadian.csv")
## Rows: 22 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): treatment
## dbl (1): shift
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
TIP: When analyzing a numeric response variable in relation to a categorical variable, it is good practice to ensure that the categorical variable is treated as a “factor” type variable by R, and that the categories or “levels” of the factor are ordered in the correct way for the purpose of graphs and / or tables of descriptive statistics. For example, there is often a “control” group (category), and if so, this group should come first in a graph or table.
Let’s follow this practice for the “circadian” dataset.
If we look at example 15.1 in the text, and the corresponding figure (Fig. 15.1), we see that there should be three treatment groups “Control”, “Knees”, and “Eyes”, in that order.
Let’s have a look at the data.
Name | Piped data |
Number of rows | 22 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
treatment | 0 | 1 | 4 | 7 | 0 | 3 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|---|
shift | 0 | 1 | -0.71 | 0.89 | -2.83 | -1.33 | -0.66 | -0.05 | 0.73 |
And also have a look at the tibble itself:
## # A tibble: 22 × 2
## treatment shift
## <chr> <dbl>
## 1 control 0.53
## 2 control 0.36
## 3 control 0.2
## 4 control -0.37
## 5 control -0.6
## 6 control -0.64
## 7 control -0.68
## 8 control -1.27
## 9 knee 0.73
## 10 knee 0.31
## # ℹ 12 more rows
The data are stored in tidy (long) format - good!
We can see the unique values of the categorical variable as follows:
## # A tibble: 3 × 1
## treatment
## <chr>
## 1 control
## 2 knee
## 3 eyes
Two important things to notice here:
- The categorical variable “treatment” is recognized as a “character” variable (“chr”) instead of a “factor” variable,
- The category names are not capitalized
Let’s recode this using the method we previously learned in the 2-sample t-test tutorial, using the recode_factor
function from the dplyr
package (loaded as part of the tidyverse
).
Be sure to provide the variables in the order that you want them to appear in any tables, figures, and analyses.
circadian <- circadian %>%
mutate(treatment = recode_factor(treatment, control = "Control", knee = "Knee", eyes = "Eyes", ))
Have a look:
## [1] Control Control Control Control Control Control Control Control Knee
## [10] Knee Knee Knee Knee Knee Knee Eyes Eyes Eyes
## [19] Eyes Eyes Eyes Eyes
## Levels: Control Knee Eyes
OK, so not only have we successfully changed the type of variable to factor, and capitalized the category names, but it turns out that the ordering of the factor “levels” is in the correct order, i.e. consistent with how it’s displayed in the text figure in example 15.1.
Now we’re ready to proceed with ANOVA!