8.3 Visualizing association between two numeric variables
We use a scatterplot to show association between two numerical variables.
We’ll use the ggplot function that we’ve seen before, along with geom_point to construct a scatterplot.
We’ll provide an example using the penguins dataset, examining how bill depth and length are associated among the penguins belonging to the Adelie species.
As shown in the tutorial on preparing and formatting assignments, we can use the filter function from the dplyr package to easily subset datasets according to some criterion, such as belonging to a specific category.
penguins %>%
filter(species == "Adelie") %>%
ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point(shape = 1) +
xlab("Bill length (mm)") +
ylab("Bill depth (mm)") +
theme_bw()
Figure 8.5: Scatterplot of the association between bill length and depth among 151 Adelie penguins
In the code chunk above, we have:
- the input tibble
penguinsfollowed by the pipe (“%>%”) - the
filterfunction with the criterion used for subsetting, specifically any cases in which the “species” categorical variable equals “Adelie” - then we provide the
ggplotfunction and itsaesargument, specifying the x- and y- variables to be used - then we use
geom_pointto tell R to create a scatterplot using points, and specifically “shape = 1” denotes hollow circles - then we have x and y labels, followed by the
theme_bwfunction telling R to use black and white theme
Notice that the figure caption indicates the number of observations (sample size) used in the plot. In a previous tutorial it was emphasized that one needs to be careful in tallying the actual number of observations being used in a graph or when calculating descriptive statistics. For example, there is one missing value (“NA”) in the bill measurements for the Adelie penguins, hence the sample size of 151 instead of 152.
Recall that you can use the skim or skim_without_charts functions to get an overview of a dataset or of a single variable in a dataset, and to figure out how many missing values there are for each variable. You can also use the summarise function, as described previously.
8.3.1 Interpreting and describing a scatterplot
Things to report when describing a scatterplot:
- is there an association? A “shotgun blast” pattern indicates no. If there is an association, is it positive or negative?
- if there is an association, is it weak, moderate, or strong?
- is the association linear? If not, is there a different pattern like concave down?
- are there any outlier observations that lie far from the general trend?
In the scatterplot above, bill length and depth are positively associated, and the association is moderately strong. There are no observations that are strongly inconsistent with the general trend, though one individual with bill length of around 35mm and depth of around 21mm may be somewhat unusual.
- Using the
penguinsdataset, create a scatterplot of flipper length in relation to body mass, and provide an appropriate figure caption.