# Food web prediction is trivial

## if you do it wrong

I currently spend a lot of time figuring out ways to predict ecological networks, and food webs and other antagonistic interactions in particular. But as it turns out, it is very easy to game this problem and obtain fantastic performance For your model. All you have to do is be spectacularly wrong. Don’t try this at home, and please don’t try this at work.

We will only need to use two packages, which if anything, goes to show that even
with the simplest of setups, we can do a lot of damage. We will also grab
`Statistics`

from the standard library, and this should be enough.

```
using EcologicalNetworks
using DecisionTree
using Statistics
```

I apologize in advance for what follows.

## Getting the data we have

We will attempt to forecast the host-parasite metaweb from Eurasia, which is my
usual benchmark dataset because it is spatially replicated, has many overlapping
species, and still displays some interesting variation. It comes with
`EcologicalNetworks`

through the `web_of_life`

function, so we can get it, and
transform it to a bipartite metaweb.

```
good_dataset(ds) = contains("Hadfield")(ds.Reference)
N = map(i -> web_of_life(i.ID), filter(good_dataset, web_of_life()))
B = convert.(BipartiteNetwork, N)
M = reduce(union, B)
```

```
206×121 bipartite ecological network (Bool, String) (L: 2131)
```

The `M`

network is a metaweb with all species and interactions across the 51
locations, and this is what we will use for the prediction. Because the metaweb
stores boolean interactions (yes/no), this is a classification problem. And for
this, we need features.

## Faking the data we don’t have

What I would really like to do is predict interactions from species traits. I cannot do that because I do not have species traits. This looks like an unsolvable situation.

**Wrong**! What I do have is a series of ways to generate pseudo-random numbers.
Let’s assume that both the hosts and the parasites have three traits each, and
let’s also assume that globally, these traits follow a uniform distribution
between 0 and 1, because this has, as you will see, literally no effect on the
performance of the prediction.

```
host_traits = rand(Float64, (3, richness(M; dims=2)))
parasite_traits = rand(Float64, (3, richness(M; dims=1)))
```

Well, that was easy. Now, we want to make predictions, so we will unfold the interaction matrix (this is our target for prediction), and generate a feature vector for every entry in the matrix, which will have 6 entries (the traits of both species). If the interactions are determined by traits, whatever predictive tool we use should figure out associations between both.

## Generating features and labels

The next step is to build a matrix of features:

```
matrix_entries = prod(size(M))
labels = zeros(Bool, matrix_entries)
features = zeros(Float64, (6, matrix_entries))
cursor = 0
for i in 1:richness(M; dims=1)
for j in 1:richness(M; dims=2)
global cursor += 1
labels[cursor] = M[i,j]
features[1:3, cursor] = host_traits[:,j]
features[4:6, cursor] = parasite_traits[:,i]
end
end
```

And we’re good to go. Notice how we haven’t made any effort to make the code efficient or nice? It’s because the problem is, as you will see, so trivial, that we don’t need to do this. Anything goes, my dudes.

## Nailing the prediction

Let’s just drop all of this into a random forest. We will need to transpose the
features vectors, because `DecisionTree`

and `Flux`

have different opinions
about how these data should be laid out, and I like `Flux`

’s convention more.

Anyways, we can get a sense of the performance of the model with n-fold cross-validation, using 20 folds, and 3 sub-features every time:

```
n_folds=20; n_subfeatures=3;
accuracy = nfoldCV_forest(labels, features', n_folds, n_subfeatures);
```

With this done, we can look at the accuracy:

```
mean(accuracy)
```

```
0.9132022471910112
```

Wow! This is very good! Mission accomplished, we have estimated about 90% of the interactions correctly.

Except we didn’t.

## Wait, what?

Here’s a fun fact about networks - they are *extremely* sparse. They tend not to
have a lot of interactions *at all*. We can actually measure the proportion of
the matrix which is interactions, and in the case of our metaweb, this is

```
connectance(M)
```

```
0.08549305945598973
```

This means that a model that would predict all zeros would get an accuracy of about

```
1.0 - connectance(M)
```

```
0.9145069405440103
```

You might notice that this is just about the same as the accuracy of our “model”
above. I like this example, because it shows how easy it is to get fooled by the
numbers that come out of the packages. An accuracy of 90% seems good (it’s
really not that fantastic but ecological data are small and complicated, so it’s
nothing to be ashamed of). But in this case, all that it reflects is that most
of the dataset is zeroes. A direct consequence of this is: whatever happens to
the ones doesn’t matter. By chance, we should get a lot of predicted zeroes, and
we will also have a lot of true zeroes, and so the True Negative compartment is
going to be so overwhelmingly large that accuracy is *bound* to look good.

For an in-progress paper, we are working on a worked version of this example where we show that it is possible to produce a model that reaches 95% accuracy on this dataset and that actually means something (I will post this here when the paper is more advanced).

Now is also a good time to remember that we are using *random traits*. This does
not mean that interactions are random - it only means that models are *really
good* at learning something from the data, even if this something is a bunch of
nonsense. In this case, the nonsense that the trees learned is “pretty much any
combination of values will give you a zero”. I like this example as a cautionary
tale - defining a “good” fit for the model demands careful consideration.

There is a very easy way to see that the model is not doing really well - Cohen’s κ. To do it a little more properly, we can get 15000 samples in our training set, and the rest in the testing set.

```
using StatsBase
train = sample(1:length(labels), 15000, replace=false)
test = filter(i -> !(i in train), 1:length(labels))
model = build_forest(labels[train], features[:,train]')
preds = apply_forest(model, features[:,test]')
```

Let’s look at the confusion matrix (which `DecisionTree`

kindly gives us with
the accurracy and κ scores):

```
confusion_matrix(labels[test], preds)
```

The value of κ is around 0.05, which usually indicates “very low accord” between the prediction and reality. Notice that the accuracy on this model is still very high, but the confusion matrix makes it very clear that only 31 interactions were correctly predicted. The model is bad, and we should not use it.

In the next post (or series of posts?), I will show a few ways to get some
*actual* predictions on this dataset, using a mix of good practices and insights
about ecology. Stay tuned.