We can't see the forest for the bird.

A little while ago, I gave a talk about the promises and challenges of high performance computing for biodiversity sciences. Because I wanted to go beyond “having more cores means we can run more model replicates”, I started by discussing the availability of data on Canadian’s biodiversity, and how we can do data-driven research. Long story short, unless we like birds, we can’t.

UNTITLED IMAGEAs of May 2017, there were about 32 millions unique, high quality, georeferenced observations of Canada’s biodiversity in GBIF. Assuming we are interested in field observations, removing the museum specimens reduces the number of observations to about 24 millions. We are already well below the big data threshold (and into the amount of data for simulation-driven ecology). But, as I mentioned, a lot of data come from citizens, and a lot of citizens are birders, and a lot of birders use eBird. So what happens if we remove these data?

Brace yourselves.

There are barely 250 thousand observations left. Of which about 100 thousand are more recent than 2010 and have no known georeferencing issues.

This is really bad.

I am not going to spoil the entire story, since a lab member is working on a paper looking at these trends in great detail, but the take home message is, this is really really bad. There are entire taxa for which we have no historical hindsight about what happened, and we do not have enough data to even describe what it happening. The issue is not even that our models to forecast or predict ranges are imperfect – it is that they risk being irrelevant unless we do something about reporting of  species occurrence data.

So what do we do?

First, citizen science will not save us, unless citizen science somehow magically scales up to unexplored areas, and broadens to include all taxa and not just birds. Unless this happens, pretending that this will help advance biodiversity science is a feel good story.

Second, we (researchers) need to realize that in the overwhelming majority of situations, species occurrences are metadata. Making this information available will not allow anyone to scoop you, and it might even help other scientists work. This is a clear-cut case of data sharing being commensalism.

Another interesting consequence is that, for the overwhelming majority of species, every observation counts. Observing another Northern Cardinal (close to 5 million occurrences in GBIF) is going to have virtually no impact on the predictions on its range. Observing an American Marten (under 9000 occurrences in GBIF) is likely to change the predictions. Maybe we should keep this in mind whenever we sell citizen science as a way to generate data on biodiversity: it is clear that it does generate an enormous amount of data; but if it’s just birds, do these data matter?