9.1 Load packages and import data

Let’s load some familiar packages first:

library(tidyverse)
library(naniar)
library(knitr)
library(skimr)

We will also need a new package called infer, so install that package using the procedure you previously learned, then load it:

library(infer)

Import Data

For this tutorial we’ll use the human gene length dataset that is used in Chapter 4 of the Whitlock & Schluter text.

The dataset is described in example 4.1 in the text.

Let’s import it:

genelengths <- read_csv("https://raw.githubusercontent.com/ubco-biology/BIOL202/main/data/humangenelength.csv")

## Rows: 22385 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): gene, name, description
## dbl (1): size
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Get an overview of the data

We’ll use skim_without_charts to get an overview:

genelengths %>%
  skim_without_charts()

(#tab:est_overview)Data summary
Name	Piped data
Number of rows	22385
Number of columns	4
_______________________
Column type frequency:
character	3
numeric	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
gene	0	1.00	17	18	22385
name	0	1.00	2	15	19906
description	432	0.98	1	51	4183

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
size	0	1	3511.46	2833.29	69	1684	2744	4511	109224

The “size” variable is the key one: it includes the gene lengths (number of nucleotides) for each of the 22385 genes in the dataset.