Advanced research computing for ecology

In a few weeks, I will be giving a talk at the Association Francophone pour le Savoir annual meeting in McGill University, about how advanced research computing (aka high performance computing) can accelerate discoveries in biodiversity sciences and ecology. Collecting data on any ecosystem, no matter how small, is painstaking. It is long. It is expensive. And as a result, we have a relatively small amount of data. So what could advanced research computing possibly deliver?

Synthesis. How about that?

Ecological synthesis is a concept with a lot of definition, and so I would like to present mine: aggregating the maximum amount of evidence to generate novel knowledge about issues at a scale which is typically larger than the one at which evidence was collected. Or to put it more simply, it is about finding out whether synthetic datasets are more than the sum of the their parts. When we suggested the use of synthetic datasets in ecology, the goal was very clearly to put together what we know, to find out things we do not know yet.

It is true that ecological data is small, but there is a lot of it. Not a Big data lot, not by any standard, but enough that putting it all together can become a problem. In a sense, ecological data are difficult to deal with, not because there are a lot, but because the field paid so little attention to making them homogeneous. Understanding how different datasets fit together is a guaranteed exercise in frustration.

UNTITLED IMAGEIt is unlikely that throwing more computing power at existing data will make data synthesis easier. This will remain, for a while, a work for the ecologist, sitting at the computer, and looking at how data are organized. And once they fit together, there will be gaps. Filling these gaps with data will require, since additional measurements or observations are usually not an option, to apply predictive models. This is where advanced research computing will shine: because the nature of the data is varied, and because the gaps are many, there will be a need to have high-performance simulations to allow the synthesis effort.

And if we can let computers generate predictions based on data once, why not automate the process? If we affect biodiversity and ecosystems in real-time, the least we can do is update our predictions at least as fast. Pulling in data in real time, and constantly updating models to generate up to date predictions will require that we have a few virtual machines buzzing steadily somewhere. This is a job for advanced research computing.

And of course, coming up with these (possibly massive) synthetic datasets is only the first step. The second is analysis, and the situation is interesting. First, the volume of data we have is increasing, but not that rapidly. Second, the amount of work we do on these data is increasing, possibly faster than new data arrive. As a consequence, the questions we ask tend to be increasingly refined, to the point where they might require to throw massive computing resources at a problem.

In the last year, I used about 70 core-years on a food web problem (not counting the time it took to generate the dataset); it means it would have taken me 70 years to get the results on a single CPU. It was about twice as much the year before. Both of these projects gave relatively simple answers to relatively straightforward problems, but in both cases I had to supercompute my way out of the shortage of suitable empirical data. These situations are only going to increase in frequency.

To summarize, there are two domains of ecological research where advanced research computing will help immensely within the next years. The first is large scale data synthesis, maybe coupled to real-time data restitution. The second is increased computational demand for analyses due to the rate of data acquisition. It is unlikely that ecologists will displace physicists, genome scientists, and climatologists at the heaviest users of advanced research computing resources. But we do have a niche to occupy here, and it will take some spotlight on current research, as well as changes in the computational skills we equip students with, to see these methods reach their full potential.