5.1 Wide Data
Non-tidy data - sometimes called "wide" data as it tends to use more columns, but fewer rows - tends to lump observations together in cells. It is often easier to collect data in this way.
So, say we collected data about the number of trout caught at local lakes across several days. We might end up with the following data tables if we used a separate table, sheet of paper etc. to record our findings.
Site | Trout_Caught_Day_1 |
---|---|
Mabel-lake | 1 |
Postill-lake | 3 |
Ellison-lake | 0 |
Site | Trout_Caught_Day_2 |
---|---|
Mabel-lake | 3 |
Postill-lake | 4 |
Ellison-lake | 5 |
Site | Trout_Caught_Day_3 |
---|---|
Mabel-lake | 3 |
Postill-lake | 5 |
Ellison-lake | 1 |
It is also very likely that we set up an Excel sheet where we recorded the site as the first column and our days and fish caught combined in subsequent columns, one column for each day. Even if we hadn’t collected our data this way, we might be tempted to group our above data together for analysis or assignment submission in this way.
Doing this, we’d end up with a table something like the following:
Site | Trout_Caught_Day_1 | Trout_Caught_Day_2 | Trout_Caught_Day_3 |
---|---|---|---|
Mabel-lake | 1 | 3 | 3 |
Postill-lake | 3 | 4 | 5 |
Ellison-Lake | 0 | 5 | 1 |
But for analysis - for "tidy" data - we want one column per variable. In this case, we have three variables:
- site
- day
- quantity caught
So let’s get this cleaned up…