Science Bioblitz, data quality, and crowd-sourcing

Thanks to support from the Canadian Wildlife Federation, the Québec Centre for Biodiversity Sciences, and the Federation of Students Associations at the Université de Montréal, we have organized a science bioblitz at the Laurentians biology field station, operated by the Université de Montréal. Now that a good fraction of the data are online, I wanted to have a look at the results.

First, wow.

We had close to 70 amazing volunteer experts, and they have recorded (so far) 1174 observations, for 341 species. To give a bit of context, there are 5850 observations in Québec on iNaturalist. What it means is that, within four days of work, we have contributed one in five observations to the total.

UNTITLED IMAGEThere are two reasons that made this possible. First, obviously, is hard work by the experts. The second, sadly, is the severe lack of data for some regions, which I discussed last week. We are at a point where every data counts, and all collection efforts should be encouraged. There is some effort to make in advertising that iNaturalist exists (it is orders of magnitude more active in the US), in addition to advocacy for researchers to deposit their occurrence data.

Second, we got a really good taxonomic coverage.

One of the challenge was to gather experts with sufficient breadth of expertise, to have a broad idea of the overall biodiversity. Looking at the results, we got 444 plant observations (160 species), 290 birds (63 species), 133 fungi (69 species), and 62 amphibians (14 species); everything else was below 50 observations – but the insect data are not here yet.

Finally, generating research grade data is difficult.

In iNaturalist, there are three levels of quality: casual (species name and occurrence), in need of id (one additional piece of evidence, such as sound or image), and finally research grade (validated by several community members at the species level). Only the research grade data are sent to GBIF. Of course, research grade is what we would prefer, but they are very difficult to achieve.

In practice, 217 plant observations currently quality as research grade (219 need identification), 45 for amphibians and fungi, and 36 for fishes. The lowest rate of research grade was for birds (almost no observations, because no pictures were taken).

There is an interesting general discussion to have on identification and expertise and quality. Relying on “auditable evidence”, i.e. every observation has one associated picture, means that validation can be done by the community, instead of having to rely entirely on the expert. But on more practical matters, some things are really difficult to take pictures of. eBird data, for example, are uploaded on GBIF with virtually no check.

There is an interesting tradeoff to explore. When the data quantity is very low, any single point can have a very strong effect on the output on any analysis. In this situation, it is justifiable to sacrifice data volume for data quality. When the volume of data is large, there is (i) a lot of redundancy and (ii) a lower chance of a single data point having very strong effects. In a way, the way iNaturalist and eBird decide which data should be gifted to the research community are emblematic of two cultures: eBird blindly trusts the expert, and iNaturalist is betting on crowd-sourcing.

There is no telling which model is  the best one (my own inclinations go towards “trust no one, especially not yourself”). But when discussing the success of different initiatives by using the volume of data as a metric, we should always, always, keep in mind that there are different degrees of filtering applied.