The grammar of data
This winter, I am teaching a class on data management and analysis to biology seniors, for the second time. Now that the data munging part of it is done, I would like to share an observation which is as brief as it is unoriginal, about an underlying principle of data manipulation.
Data manipulation is a difficult thing to teach, because it requires students to tackle several challenges at once. First, thinking about data in a unified and systematic way, as opposed to the ad hoc approach that is acquired to practice without supervision. Second, thinking about the tools to perform the tasks, and adapt to the very different UI/UX and nomenclature of each. Finally, and this is the part I would like to discuss, thinking about data cleaning/manipulation/reshaping as a process, and therefore thinking of tools as being many ways to work through this process.
We had three back-to-back classes on (i) data cleaning with OpenRefine,
(ii) programmatic data manipulation with
R, and (iii) relational data with
SQLite; very standard walk through the fantastic Data Carpentry
material, in the spirit of reusing community material instead of creating
At the end of the final class, we briefly discussed the fact that all of these tools let us do a series of four things, and if you understand these four things, you understand 99.5% of data analysis – in short, there is a grammar for data (and yes, this is the selling point of the tidyverse, but this grammar existed far before the tidyverse, and the tendency of the tidycrowd to appropriate everything and sell it as new irks me).
Even better, these four things can be presented as questions.
What do I want? Or rather, what do I want to leave out. This is the domain
of functions like
filter, etc. Whether we apply this to columns
or row is not really important; what matter is that we start any problem by
defining the scope of the data we need.
What do I want to do? This is where we transform (or mutate) our selected rows or columns, or create some more, or merge them, or any other operation we want. This step is usually the meat of the data manipulation process.
What do I want to know? Is it an average? Is it some other aggregate
statistic? This is when
summarize-like function come into play. This step
is essentially going from many to few rows.
What are my grouping variables? The final step creates our groups in the
output. This is
GROUP BY in
SQL, and other things in
R, and the facets
That’s it. Any additional transformation (re-order columns/rows) is cosmetic. Next year, I will feature this approach in a more prominent way in the class. I hope it will make it easier for students to understand that the tools presented have, in fact, more things in common than they have differences.