Data analysis: content and style separation
One extremely powerful paradigm when writing documents is the separation of
content and presentation. Languages like LaTeX
, html
, and markdown
excel at this – for example the following text
# Title
This is a sentence with **important information**
can be rendered in a different number of formats and ways based on the style we decide to apply. Ultimately, producing a document is a matter of both content and styling, but it helps to keep these separate. When I am thinking about the best way to introduce an idea, I don’t really care that it will be presented on a two-columns layout in a serif font, or in a single-column web page. Keeping my content separated (as much as possible) from its styling means that it is also easier to reformat it for a different purpose. It is, in short, Good Practice.
And we can do the same for data analysis.
One thing that confuses learners when we develop analytic pipelines is the mix between (i) data merging, (ii) data reshaping, and (iii) transforming operations on data. Basic data analysis has its own grammar (or close enough to it anyways), but adding calculations in this somewhat muddles the process.
One of the practice I suggest learners use is to keep the “content” (data
manipulation using group
, select
, join
and their equivalents) separate
from the “style” (the operations that are applied once the data are in the shape
we need). In short, treat data as if they were immutable.
Data are, of course, immensely mutable, and this is what makes them so cool; but also dangerous. I have no illusions about my ability to perfectly understand the state my dataframe is in if I chain together more than five or six operations, especially if some of these involves data transformations or the creation of new variables.
And so to protect my own data analysis from my own scatterbrained self, I try to keep things as separate as possible.
I know long series of piped curried operations look cool, but listen up kiddos, they really don’t. It’s much safer (and a whole lot more boring) to apply small changes that are consistent (in terms of what they do), assign these intermediate artifacts, and work slowly but cautiously. Hi, I hate fun.
The bit about assigning intermediate artifacts (as in, put the result of these operations in a variable instead of piping them directly) is in fact key: not only can you inspect the state of your dataset, you can also save it and checkpoint your code – and if you need to take branching paths, you can reuse an intermediate state.
Back in my days, we were warned about writing an entire analysis as a series of lines in a script file. It was a good warning. It now needs to be replaced by a warning about not expressing your entire analysis as a single series of pipes. Baby steps.