6.2 Get an overview of the data

The penguins object is a tibble, with each row representing a case and each column representing a variable. Tibbles can store a mixture of data types: numeric variables, categorical variables, logical variables etc… all in the same object (as separate columns). This isn’t the case with other object types (e.g. matrices).

We’ll get an overview of the data using the skim_without_charts function, as we learned in the Preparing and Importing Tidy Data tutorial:

penguins %>%
  skim_without_charts()
(#tab:vis1_skim)Data summary
Name Piped data
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0
year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0

Optionally, we can also get a view of the first handful of rows of a tibble by simply typing the name of the object on its own, and hitting return:

penguins 
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

In a previous tutorial you learned the important information to look for when getting an overview of a dataset using the skim_without_charts function.

TIP It’s important to check whether there are any missing values for any of the variables in your dataset. In the penguins dataset, you’ll see from the skim_without_charts output that there are 344 cases (rows), but (as an example) there are 2 missing values for each of the 4 morphometric variables, including body mass. You need to take note of this so that you report the correct sample sizes in any table or figure captions!

Once you have gotten an overview your dataset’s structure and contents, the next order of business is always to visualize your data using graphs and sometimes tables.

  1. Import and data overview: Following the instructions provided in previous tutorials, import the tigerdeaths.csv and birds.csv datasets, and get an overview of each of those datasets.