5.5 Get an overview of a dataset

When you import data it is always a good idea to immediately get an overview of the data.

Key questions you want to be able to answer are:

How many variables (columns) are there in the dataset?
How many observations (rows) are in the dataset?
Are there variables whose data are categorical? If so, which ones?
Are there variables whose data are numerical? If so, which ones?
Are there observations missing anywhere?

As we learned in the “Preparing and formatting assignments” tutorial, the skimr package has a handy function called skim_without_charts that provides a good overview of a data object. This is a rather long function name, and in fact the main function is called skim. However, by default, skim includes small charts in its output, and we don’t want that presently, hence the use of skim_without_charts.

Let’s load that package now:

library(skimr)

And get an overview of the trout dataset, again using the pipe approach:

trout %>%
  skim_without_charts()

Table 5.1: Data summary
Name	Piped data
Number of rows	9
Number of columns	3
_______________________
Column type frequency:
character	1
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Site	0	1	10	12	0	3	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
Day	0	1	2.00	0.87	1	1	2	3	3
Trout_Caught	0	1	2.78	1.79	0	1	3	4	5

A lot of information is provided in this summary output, so let’s go through it:

The Data Summary shows the name of the object, the number of rows, and the number of columns
Column type frequency shows how many columns (variables) are of type “character”, which is equivalent to “categorical”, and how many are “numeric”
Group variables shows if there are any variables that are specified as “grouping” variables, something we don’t cover yet.
Then it provides summaries of each of the variables, starting with the character or categorical variables, followed by the numeric variables
Each summary includes a variety of descriptors, described next
The “n_missing” descriptor tells you how many observations are missing in the given variable. In the “trout” dataset we don’t have any missing values
The “n_unique” descriptor for categorical variables indicates how many unique values (categories) are in that variable; for the “Site” variable in the “trout” dataset there are 3 unique values
The descriptors for the numeric variables include the mean, standard deviation (sd), and the quantiles

Now you have what you need to answer each of the questions listed above!

One additional function that is useful during the overview stage is head. This function just gives you a view of the first 6 rows of the dataset:

# we can use head(trout), or the pipe approach: 
trout %>%
  head()

## # A tibble: 6 × 3
##   Site           Day Trout_Caught
##   <chr>        <dbl>        <dbl>
## 1 Mabel-lake       1            1
## 2 Mabel-lake       2            3
## 3 Mabel-lake       3            3
## 4 Postill-lake     1            3
## 5 Postill-lake     2            4
## 6 Postill-lake     3            5

When you use head on a “tibble”, like we have here, it outputs another “tibble”, in this case 6 rows by 3 columns. But recall that the full “trout” dataset includes 9 rows and 3 columns.