The futility of sharing ecological data
Last week, I was part of a very interesting discussion about how data sharing in ecology has, so far, failed. Up to 64% of archived datasets are made public in a way that prevents re-use, but this is not even the biggest problem. We are currently sharing ecological data in a way that is mostly useless.
Why do we share data?
The usual elements of response are, because their collection is publicly funded, because it makes science auditable and transparent, because open data allow anyone to work on them regardless of their belonging to a close circle of colleagues, and finally because these data can be re-purposed for other studies.
Most of these goals require the archival of raw data, code, and of some computational artifacts, that were used to produce the paper. The end goal is to replicate or reproduce the original study, either to validate it or calibrate novel methods and models, and this is best done if the original workflow is integrally available.
The last goal (re-purposing for other studies), however, requires the deposition in databases that are open, persistent, and programmatically searchable. We want to aggregate data across studies (as we illustrated in a paper on food web reconstruction at the global scale), and this is just not possible if this requires to look for thousands of studies, have to understand how the data are structured, and write the code to extract them.
These two approaches have different focuses.
The current view is “study-centric”, in that it packages a single study products neatly, for anyone to replicate. It is aimed at replicability, but has a limited potential to generate new insights. Oh, and as the statistic of 64% of datasets being unusable shows, we are bad at this anyways (sometimes by lack of training, and sometimes by cheating the system to get the paper published without really following the data publication guidelines).
Moving to a view that would be “data-centric”, in that data types are assigned to specified, standard databases, would be orders of magnitude better. First, it does not prevent to adopt a “study-centric” view, since the various components of the dataset can be located (and you also know what to expect, since the databases would have a standard formatting). Second, it allows fast large scale synthesis, because it is possible to write a few lines of code to hit these databases and get the results. How fast? Look at the rGBIF package tutorial, for example.
So, how do we make it happen?
Slowly.
First, journals need to be more prescriptive in the ways data are archived. Putting the raw data on Dryad or figshare is good, but there should be additional requirements. For example, depositing all occurrence data on GBIF would be easy, and have an immediate benefit (more data for species distribution).
Second, we need to think about standardizing as much types of ecological data as possible. The work on species occurrences is largely done. We did a lot of work on species interactions (and will expand the data format in the coming months). Functional trait data are the wild west (and the best projects have data access policy that are so bad I won’t even link them). There is a lot of rote ecoinformatics work to be done, for sure. But this is also deeply interesting work: it requires us to define how we think about data.
My opinion is that we are at a crossroad for the sharing of ecological data. Either we keep on doing business as usual, in which case sharing is unlikely to result in many noteworthy discoveries, and the synthesis effort will continue to give underwhelming results. Or we collectively step up, realize that ecological data are precious and relevant, and start implementing the strategies to give them their entire potential.