5.5 Get an overview of a dataset
When you import data it is always a good idea to immediately get an overview of the data.
Key questions you want to be able to answer are:
- How many variables (columns) are there in the dataset?
- How many observations (rows) are in the dataset?
- Are there variables whose data are categorical? If so, which ones?
- Are there variables whose data are numerical? If so, which ones?
- Are there observations missing anywhere?
As we learned in the “Preparing and formatting assignments” tutorial, the skimr
package has a handy function called skim_without_charts
that provides a good overview of a data object. This is a rather long function name, and in fact the main function is called skim
. However, by default, skim
includes small charts in its output, and we don’t want that presently, hence the use of skim_without_charts
.
Let’s load that package now:
And get an overview of the trout
dataset, again using the pipe approach:
Name | Piped data |
Number of rows | 9 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Site | 0 | 1 | 10 | 12 | 0 | 3 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|---|
Day | 0 | 1 | 2.00 | 0.87 | 1 | 1 | 2 | 3 | 3 |
Trout_Caught | 0 | 1 | 2.78 | 1.79 | 0 | 1 | 3 | 4 | 5 |
A lot of information is provided in this summary output, so let’s go through it:
- The Data Summary shows the name of the object, the number of rows, and the number of columns
- Column type frequency shows how many columns (variables) are of type “character”, which is equivalent to “categorical”, and how many are “numeric”
- Group variables shows if there are any variables that are specified as “grouping” variables, something we don’t cover yet.
- Then it provides summaries of each of the variables, starting with the character or categorical variables, followed by the numeric variables
- Each summary includes a variety of descriptors, described next
- The “n_missing” descriptor tells you how many observations are missing in the given variable. In the “trout” dataset we don’t have any missing values
- The “n_unique” descriptor for categorical variables indicates how many unique values (categories) are in that variable; for the “Site” variable in the “trout” dataset there are 3 unique values
- The descriptors for the numeric variables include the mean, standard deviation (sd), and the quantiles
Now you have what you need to answer each of the questions listed above!
One additional function that is useful during the overview stage is head
. This function just gives you a view of the first 6 rows of the dataset:
## # A tibble: 6 × 3
## Site Day Trout_Caught
## <chr> <dbl> <dbl>
## 1 Mabel-lake 1 1
## 2 Mabel-lake 2 3
## 3 Mabel-lake 3 3
## 4 Postill-lake 1 3
## 5 Postill-lake 2 4
## 6 Postill-lake 3 5
When you use head
on a “tibble”, like we have here, it outputs another “tibble”, in this case 6 rows by 3 columns. But recall that the full “trout” dataset includes 9 rows and 3 columns.