9.1 Load packages and import data

Let’s load some familiar packages first:

library(tidyverse)
library(naniar)
library(knitr)
library(skimr)

We will also need a new package called infer, so install that package using the procedure you previously learned, then load it:

library(infer)

Import Data

For this tutorial we’ll use the human gene length dataset that is used in Chapter 4 of the Whitlock & Schluter text.

The dataset is described in example 4.1 in the text.

Let’s import it:

genelengths <- read_csv("https://raw.githubusercontent.com/ubco-biology/BIOL202/main/data/humangenelength.csv")
## Rows: 22385 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): gene, name, description
## dbl (1): size
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Get an overview of the data

We’ll use skim_without_charts to get an overview:

genelengths %>%
  skim_without_charts()
(#tab:est_overview)Data summary
Name Piped data
Number of rows 22385
Number of columns 4
_______________________
Column type frequency:
character 3
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
gene 0 1.00 17 18 0 22385 0
name 0 1.00 2 15 0 19906 0
description 432 0.98 1 51 0 4183 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
size 0 1 3511.46 2833.29 69 1684 2744 4511 109224

The “size” variable is the key one: it includes the gene lengths (number of nucleotides) for each of the 22385 genes in the dataset.