Data is not just data anymore

Last week, I was part of a discussion group with a bunch of very smart people, and we discussed data archival and re-use. One of the points that emerged during the discussion is that data is now rarely just data anymore. Because we look at increasingly ambitious data analysis projects, the proportion of steps that are done automatically is also increasing.

The raw data, once collected, are fed through a code which is then used to produce aggregated data, which are then re-organized into figures, plus or minus some amount of statistical analysis and/or simulation. Hopefully, no step of this process is done manually, because it would render the process (i) difficult to reproduce and (ii) essentially impossible to audit at a later point. And in a sense, the code used to massage the data into something usable becomes as important as the data themselves.

From a raw dataset to a figure, there are infinitely many paths that we can use; most of them are wrong, and most of them are wrong in ways that are not trivial. In most cases, I am more willing to trust the data than I am to trust the analysis pipeline that produced the figure – most ecologists have been trained at data collection, very few have been properly trained at data manipulation.

But this is not the point I want to make. In the context of synthesis, most if not all data come from external sources. This involves a degree of data-mining, where you take your data-pickaxe and go hit some data-rocks until you find some data worthy of being used. It is a frustrating process, and one that requires a lot of thinking ahead.

Data synthesis should be approached the way meta-analyses are approached: define a question, then express a specific set of rules that determine which dataset, datapoint, paper, is retained for the analysis, and then use this as the basis for the study. There is currently a very hunter-gatherer-like culture in data re-use (myself included), and it should be replaced with a more formal approach.

The good thing is that rules about data inclusion can be expressed in a programmatic way. This process, i.e. formalizing the rules for inclusion, the sources of data, and the specific parameters of each search, define the data, and is part of them. This is even more so when the data we work with are changing. The same request run on eBird two days ago and today can return different results.

Like Heraclitus said, “No man ever steps in the same river twice, for it is not the same river and he is not the same man”. The river/data are always changing, the least we can do is assure that the man/code that steps in them remains the same. This starts by recognizing that data are not just data anymore, and archiving our work appropriately.