Beware data science in ecology

I finished reading Everybody Lies, a book on how Big Data is changing the way we can understand how humans think (specifically the subset of humans using Google and Facebook, but that is besides the point). The book relies on a lot of illustrations of numerical experiments, and it was difficult for me (because I am preparing manuscripts and talks on this topic) to not have ecological research as a background task in my brain.

One thing I tell the students in the data science class I give in the winter is that working on large or small amount of data is difficult for different reasons. In a small dataset, the challenge is finding signal, and then having statistical power to discuss it. In a large dataset, the challenge is deciding which signal to ignore.

This is because data science (this weird interaction of statistics and machine learning, with the goal to extract insights from data) is fantastic at three things: identifying signal, matching different signals, and sometimes (because of dimensionality issues and overfitting), creating signal where none exists. This is not surprisingly different from some methods we use in ecology. PCA for example is good at placing similar things together, and we in turn are good are deriving meaning from the clustering (whether we do this by reading the figure, or by trying different algorithms until we find one whose output makes “Good Ecological Sense” is not really important).

The problem with the usual data science algorithms is that, when given sufficient amount of data, they can do this at scale. It becomes much easier to get dozens of correlations to sift through, and decide whether to care about them or not. If not, we will end up in the situation previously occupied by evolutionary biology in the late 1970s, where @GoulLewo79 for example criticized the adaptationists for providing more “just-so stories” than they did anything else.

In a sense, this is because data science is widely used in the business world, which (not shockingly) has both different priorities and different standards of evidence than research does. In fact, Everybody Lies is very explicit about the fact that data science often replaces understanding of the mechanisms, or is the starting point to weave a compelling narrative, often ultimately to sell products. So we can use the algorithms (I am very insistent that we should, and in fact there is a clear movement in that direction), but we need to be aware of their dangers.

Ultimately, this has implications for training. Because if we want to apply data science to ecology, we have two paths: either we train algorithmically competent students in ecology, or we train ecologists in machine learning. I am a firm believer in the second solution – applying these tools is not the difficult part (and everyone treats them as black boxes anyways…). The difficult part is to decide which to apply based on specific hypotheses and intuitions about the mechanisms. A prediction, no matter how robust, is not going to get us very far if we do not understand the mechanisms involved. For this reason alone, I have more hope in training ecologists in these methods, than the other way around. Proceeding differently leaves us, as a field, wide open to post-hoc theorizing, and this is not a skill we should equip students with.