I currently spend a lot of time figuring out ways to predict ecological networks, and food webs and other antagonistic interactions in particular. But as it turns out, it is very easy to game this problem and obtain fantastic performance For your model. All you have to do is be spectacularly wrong. Don’t try this at home, and please don’t try this at work.
We will only need to use two packages, which if anything, goes to show that even
with the simplest of setups, we can do a lot of damage. We will also grab
Statistics from the standard library, and this should be enough.
using EcologicalNetworks using DecisionTree using Statistics
I apologize in advance for what follows.
Getting the data we have
We will attempt to forecast the host-parasite metaweb from Eurasia, which is my
usual benchmark dataset because it is spatially replicated, has many overlapping
species, and still displays some interesting variation. It comes with
EcologicalNetworks through the
web_of_life function, so we can get it, and
transform it to a bipartite metaweb.
good_dataset(ds) = contains("Hadfield")(ds.Reference) N = map(i -> web_of_life(i.ID), filter(good_dataset, web_of_life())) B = convert.(BipartiteNetwork, N) M = reduce(union, B)
206×121 bipartite ecological network (Bool, String) (L: 2131)
M network is a metaweb with all species and interactions across the 51
locations, and this is what we will use for the prediction. Because the metaweb
stores boolean interactions (yes/no), this is a classification problem. And for
this, we need features.
Faking the data we don’t have
What I would really like to do is predict interactions from species traits. I cannot do that because I do not have species traits. This looks like an unsolvable situation.
Wrong! What I do have is a series of ways to generate pseudo-random numbers. Let’s assume that both the hosts and the parasites have three traits each, and let’s also assume that globally, these traits follow a uniform distribution between 0 and 1, because this has, as you will see, literally no effect on the performance of the prediction.
host_traits = rand(Float64, (3, richness(M; dims=2))) parasite_traits = rand(Float64, (3, richness(M; dims=1)))
Well, that was easy. Now, we want to make predictions, so we will unfold the interaction matrix (this is our target for prediction), and generate a feature vector for every entry in the matrix, which will have 6 entries (the traits of both species). If the interactions are determined by traits, whatever predictive tool we use should figure out associations between both.
Generating features and labels
The next step is to build a matrix of features:
matrix_entries = prod(size(M)) labels = zeros(Bool, matrix_entries) features = zeros(Float64, (6, matrix_entries)) cursor = 0 for i in 1:richness(M; dims=1) for j in 1:richness(M; dims=2) global cursor += 1 labels[cursor] = M[i,j] features[1:3, cursor] = host_traits[:,j] features[4:6, cursor] = parasite_traits[:,i] end end
And we’re good to go. Notice how we haven’t made any effort to make the code efficient or nice? It’s because the problem is, as you will see, so trivial, that we don’t need to do this. Anything goes, my dudes.
Nailing the prediction
Let’s just drop all of this into a random forest. We will need to transpose the
features vectors, because
Flux have different opinions
about how these data should be laid out, and I like
Flux's convention more.
Anyways, we can get a sense of the performance of the model with n-fold cross-validation, using 20 folds, and 3 sub-features every time:
n_folds=20; n_subfeatures=3; accuracy = nfoldCV_forest(labels, features', n_folds, n_subfeatures);
With this done, we can look at the accuracy:
Wow! This is very good! Mission accomplished, we have estimated about 90% of the interactions correctly.
Except we didn’t.
Here’s a fun fact about networks - they are extremely sparse. They tend not to have a lot of interactions at all. We can actually measure the proportion of the matrix which is interactions, and in the case of our metaweb, this is
This means that a model that would predict all zeros would get an accuracy of about
1.0 - connectance(M)
You might notice that this is just about the same as the accuracy of our “model” above. I like this example, because it shows how easy it is to get fooled by the numbers that come out of the packages. An accuracy of 90% seems good (it’s really not that fantastic but ecological data are small and complicated, so it’s nothing to be ashamed of). But in this case, all that it reflects is that most of the dataset is zeroes. A direct consequence of this is: whatever happens to the ones doesn’t matter. By chance, we should get a lot of predicted zeroes, and we will also have a lot of true zeroes, and so the True Negative compartment is going to be so overwhelmingly large that accuracy is bound to look good.
For an in-progress paper, we are working on a worked version of this example where we show that it is possible to produce a model that reaches 95% accuracy on this dataset and that actually means something (I will post this here when the paper is more advanced).
Now is also a good time to remember that we are using random traits. This does not mean that interactions are random - it only means that models are really good at learning something from the data, even if this something is a bunch of nonsense. In this case, the nonsense that the trees learned is “pretty much any combination of values will give you a zero”. I like this example as a cautionary tale - defining a “good” fit for the model demands careful consideration.
There is a very easy way to see that the model is not doing really well - Cohen’s κ. To do it a little more properly, we can get 15000 samples in our training set, and the rest in the testing set.
using StatsBase train = sample(1:length(labels), 15000, replace=false) test = filter(i -> !(i in train), 1:length(labels)) model = build_forest(labels[train], features[:,train]') preds = apply_forest(model, features[:,test]')
Let’s look at the confusion matrix (which
DecisionTree kindly gives us with
the accurracy and κ scores):
The value of κ is around 0.05, which usually indicates “very low accord” between the prediction and reality. Notice that the accuracy on this model is still very high, but the confusion matrix makes it very clear that only 31 interactions were correctly predicted. The model is bad, and we should not use it.
In the next post (or series of posts?), I will show a few ways to get some actual predictions on this dataset, using a mix of good practices and insights about ecology. Stay tuned.