When you import data it is always a good idea to immediately get an overview of the data.
Key questions you want to be able to answer are:
- How many variables (columns) are there in the dataset?
- How many observations (rows) are in the dataset?
- Are there variables whose data are categorical? If so, which ones?
- Are there variables whose data are numerical? If so, which ones?
- Are there observations missing anywhere?
As we learned in the “Preparing and formatting assignments” tutorial, the
skimr package has a handy function called
skim_without_charts that provides a good overview of a data object. This is a rather long function name, and in fact the main function is called
skim. However, by default,
skim includes small charts in its output, and we don’t want that presently, hence the use of
Let’s load that package now:
And get an overview of the
trout dataset, again using the pipe approach:
%>% trout skim_without_charts()
|Number of rows||9|
|Number of columns||3|
|Column type frequency:|
Variable type: character
Variable type: numeric
A lot of information is provided in this summary output, so let’s go through it:
- The Data Summary shows the name of the object, the number of rows, and the number of columns
- Column type frequency shows how many columns (variables) are of type “character”, which is equivalent to “categorical”, and how many are “numeric”
- Group variables shows if there are any variables that are specified as “grouping” variables, something we don’t cover yet.
- Then it provides summaries of each of the variables, starting with the character or categorical variables, followed by the numeric variables
- Each summary includes a variety of descriptors, described next
- The “n_missing” descriptor tells you how many observations are missing in the given variable. In the “trout” dataset we don’t have any missing values
- The “n_unique” descriptor for categorical variables indicates how many unique values (categories) are in that variable; for the “Site” variable in the “trout” dataset there are 3 unique values
- The descriptors for the numeric variables include the mean, standard deviation (sd), and the quantiles
Now you have what you need to answer each of the questions listed above!
One additional function that is useful during the overview stage is
head. This function just gives you a view of the first 6 rows of the dataset:
# we can use head(trout), or the pipe approach: %>% trout head()
## # A tibble: 6 × 3 ## Site Day Trout_Caught ## <chr> <dbl> <dbl> ## 1 Mabel-lake 1 1 ## 2 Mabel-lake 2 3 ## 3 Mabel-lake 3 3 ## 4 Postill-lake 1 3 ## 5 Postill-lake 2 4 ## 6 Postill-lake 3 5
When you use
head on a “tibble”, like we have here, it outputs another “tibble”, in this case 6 rows by 3 columns. But recall that the full “trout” dataset includes 9 rows and 3 columns.