16.1 Load packages and import data

Load the tidyverse, knitr, naniar, car, skimr, and broom packages:

library(tidyverse)
library(knitr)
library(naniar)
library(skimr)
library(broom)
library(car)

For this tutorial we’ll use the “circadian.csv” dataset. These data are associated with Example 15.1 in the text.

The circadian data describe melatonin production in 22 people randomly assigned to one of three light treatments.

circadian <- read_csv("https://raw.githubusercontent.com/ubco-biology/BIOL202/main/data/circadian.csv")
## Rows: 22 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): treatment
## dbl (1): shift
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

TIP: When analyzing a numeric response variable in relation to a categorical variable, it is good practice to ensure that the categorical variable is treated as a “factor” type variable by R, and that the categories or “levels” of the factor are ordered in the correct way for the purpose of graphs and / or tables of descriptive statistics. For example, there is often a “control” group (category), and if so, this group should come first in a graph or table.

Let’s follow this practice for the “circadian” dataset.

If we look at example 15.1 in the text, and the corresponding figure (Fig. 15.1), we see that there should be three treatment groups “Control”, “Knees”, and “Eyes”, in that order.

Let’s have a look at the data.

circadian %>%
  skim_without_charts()
(#tab:anova_lookdata)Data summary
Name Piped data
Number of rows 22
Number of columns 2
_______________________
Column type frequency:
character 1
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
treatment 0 1 4 7 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
shift 0 1 -0.71 0.89 -2.83 -1.33 -0.66 -0.05 0.73

And also have a look at the tibble itself:

circadian
## # A tibble: 22 × 2
##    treatment shift
##    <chr>     <dbl>
##  1 control    0.53
##  2 control    0.36
##  3 control    0.2 
##  4 control   -0.37
##  5 control   -0.6 
##  6 control   -0.64
##  7 control   -0.68
##  8 control   -1.27
##  9 knee       0.73
## 10 knee       0.31
## # ℹ 12 more rows

The data are stored in tidy (long) format - good!

We can see the unique values of the categorical variable as follows:

circadian %>%
  select(treatment) %>%
  unique()
## # A tibble: 3 × 1
##   treatment
##   <chr>    
## 1 control  
## 2 knee     
## 3 eyes

Two important things to notice here:

  • The categorical variable “treatment” is recognized as a “character” variable (“chr”) instead of a “factor” variable,
  • The category names are not capitalized

Let’s recode this using the method we previously learned in the 2-sample t-test tutorial, using the recode_factor function from the dplyr package (loaded as part of the tidyverse).

Be sure to provide the variables in the order that you want them to appear in any tables, figures, and analyses.

circadian <- circadian %>%
  mutate(treatment = recode_factor(treatment, control = "Control", knee = "Knee", eyes = "Eyes", ))

Have a look:

circadian$treatment
##  [1] Control Control Control Control Control Control Control Control Knee   
## [10] Knee    Knee    Knee    Knee    Knee    Knee    Eyes    Eyes    Eyes   
## [19] Eyes    Eyes    Eyes    Eyes   
## Levels: Control Knee Eyes

OK, so not only have we successfully changed the type of variable to factor, and capitalized the category names, but it turns out that the ordering of the factor “levels” is in the correct order, i.e. consistent with how it’s displayed in the text figure in example 15.1.

Now we’re ready to proceed with ANOVA!