Data summaries with “gtsummary” package
This tutorial introduces an alternative to the skimr
package for getting overviews of datasets.
The skimr
package and its skim_without_charts
function can cause issues when knitting to PDF.
The gtsummary
package appears to have fewer such issues.
If you haven’t already, install the gtsummary
package by typing this in your console (do this only once):
install.packages("gtsummary")
Let’s load packages …
The key function in the gtsummary
package is the tbl_summary
function:
?tbl_summary
Take note of the default settings for the “statistic” argument… by default, the function will return the median and IQR for numeric variables, and the sample size and relative frequency (expressed as percentage) for categorical variables.
For details on this function, along with a tutorial, see this webpage.
We’ll get an overview of the data using the tbl_summary
function.
Characteristic | N = 3441 |
---|---|
species | |
Adelie | 152 (44%) |
Chinstrap | 68 (20%) |
Gentoo | 124 (36%) |
island | |
Biscoe | 168 (49%) |
Dream | 124 (36%) |
Torgersen | 52 (15%) |
bill_length_mm | 44.5 (39.2, 48.5) |
Unknown | 2 |
bill_depth_mm | 17.30 (15.60, 18.70) |
Unknown | 2 |
flipper_length_mm | 197 (190, 213) |
Unknown | 2 |
body_mass_g | 4,050 (3,550, 4,750) |
Unknown | 2 |
sex | |
female | 165 (50%) |
male | 168 (50%) |
Unknown | 11 |
year | |
2007 | 110 (32%) |
2008 | 114 (33%) |
2009 | 120 (35%) |
1 n (%); Median (Q1, Q3) |
We could select just some numeric variables, and ask for the mean and standard deviation. Note the syntax for the “statistic” argument… we have to provide a “list”, as follows:
penguins %>%
select(bill_length_mm, bill_depth_mm) %>%
tbl_summary(statistic = list(all_continuous() ~ "{mean} ({sd})"))
Characteristic | N = 3441 |
---|---|
bill_length_mm | 43.9 (5.5) |
Unknown | 2 |
bill_depth_mm | 17.15 (1.97) |
Unknown | 2 |
1 Mean (SD) |
So, if you find youself running into issues with the skimr
package, feel free to use the gtsummary
package instead!