5.1 Wide Data

Non-tidy data - sometimes called "wide" data as it tends to use more columns, but fewer rows - tends to lump observations together in cells. It is often easier to collect data in this way.

So, say we collected data about the number of trout caught at local lakes across several days. We might end up with the following data tables if we used a separate table, sheet of paper etc. to record our findings.

Site Trout_Caught_Day_1
Mabel-lake 1
Postill-lake 3
Ellison-lake 0
Site Trout_Caught_Day_2
Mabel-lake 3
Postill-lake 4
Ellison-lake 5
Site Trout_Caught_Day_3
Mabel-lake 3
Postill-lake 5
Ellison-lake 1

It is also very likely that we set up an Excel sheet where we recorded the site as the first column and our days and fish caught combined in subsequent columns, one column for each day. Even if we hadn’t collected our data this way, we might be tempted to group our above data together for analysis or assignment submission in this way.

Doing this, we’d end up with a table something like the following:

Site Trout_Caught_Day_1 Trout_Caught_Day_2 Trout_Caught_Day_3
Mabel-lake 1 3 3
Postill-lake 3 4 5
Ellison-Lake 0 5 1

But for analysis - for "tidy" data - we want one column per variable. In this case, we have three variables:

  • site
  • day
  • quantity caught

So let’s get this cleaned up…