Data quality is a myth

A common question when aggregating data from multiple sources is, “but how do we know if the data is good?”. The short answer, most of the time, is “we don’t”.

And we don’t care.

Data quality is a complex notion, whose definition varies according to where in the Data Life Cycle one is currently residing. As a data producer, low data quality may refer to instrument failure, untrained data collectors, or maybe some contextual variable that can result in not-quite-perfect data. But as soon as the data is written down, my stance is that its quality stops being something we can defined. Instead, the important trait of data is its fitness for purpose.

Let’s assume you measure hematocrit in riverine fish, and because you want to sample the blood on live fish the volume you can get is low, so the data are uncertain (in particular, your estimate of the proportion of white blood cells is barely better than a wild guess). Is this dataset good? It depends.

If your aim is to draw conclusion on fish hematocrit, then no, the dataset is most likely unfit for purpose. But there is still a lot to salvage. First, the identity of sampled species and their position is valuable occurrence data; presence-only data, but a lot of good science is built on that. Second, you most likely have information about the environment. Maybe you also have information about the sex, body length, and body mass of all sampled individuals. These can be used for different projects: either alternative studies, or for data synthesis.

There is rarely such a thing as bad data; but there are mismatches between the data and their framing within a research project.

A substantial issue is the difficulty of evaluating fitness for purpose. Ideally, this would be done based on metadata, and metadata standards in ecology & evolution are… well I originally wrote insufficient, but this implies that they exist at all, which I am yet to see widespread evidence of. Our issue is not bad data; our issue is inadequate meta-data discipline, which in turns prevents a correct evaluation of the fitness for purpose, and ultimately limits the synthesis effort.

Data deposition, in this regard, has done far more harm than good. Even leaving aside the issue of reporting summary data instead of the raw dataset, the lack of formats, of the lack of discipline to adhere to these standards, means that the total sum of ecological data archived publicly is a sticky mess, and trying to put together more than two datasets is an exercise in dadaism.

Surprisingly, I don’t think the solution to this is more training. Excellent resources are freely available, but obviously underused. The solution is to take greater responsibility in data deposition, and to do the bare minimum of dataset documentation. Or, in other words, that’s an unsolvable problem.