The grammar of data

This winter, I am teaching a class on data management and analysis to biology seniors, for the second time. Now that the data munging part of it is done, I would like to share an observation which is as brief as it is unoriginal, about an underlying principle of data manipulation.

Data manipulation is a difficult thing to teach, because it requires students to tackle several challenges at once. First, thinking about data in a unified and systematic way, as opposed to the ad hoc approach that is acquired to practice without supervision. Second, thinking about the tools to perform the tasks, and adapt to the very different UI/UX and nomenclature of each. Finally, and this is the part I would like to discuss, thinking about data cleaning/manipulation/reshaping as a process, and therefore thinking of tools as being many ways to work through this process.

We had three back-to-back classes on (i) data cleaning with OpenRefine, (ii) programmatic data manipulation with R, and (iii) relational data with SQLite; very standard walk through the fantastic Data Carpentry material, in the spirit of reusing community material instead of creating more.

At the end of the final class, we briefly discussed the fact that all of these tools let us do a series of four things, and if you understand these four things, you understand 99.5% of data analysis – in short, there is a grammar for data (and yes, this is the selling point of the tidyverse, but this grammar existed far before the tidyverse, and the tendency of the tidycrowd to appropriate everything and sell it as new irks me).

Even better, these four things can be presented as questions.

What do I want? Or rather, what do I want to leave out. This is the domain of functions like select, filter, etc. Whether we apply this to columns or row is not really important; what matter is that we start any problem by defining the scope of the data we need.

What do I want to do? This is where we transform (or mutate) our selected rows or columns, or create some more, or merge them, or any other operation we want. This step is usually the meat of the data manipulation process.

What do I want to know? Is it an average? Is it some other aggregate statistic? This is when summarize-like function come into play. This step is essentially going from many to few rows.

What are my grouping variables? The final step creates our groups in the output. This is GROUP BY in SQL, and other things in R, and the facets in OpenRefine.

That’s it. Any additional transformation (re-order columns/rows) is cosmetic. Next year, I will feature this approach in a more prominent way in the class. I hope it will make it easier for students to understand that the tools presented have, in fact, more things in common than they have differences.