# Validation metrics for the prediction of species interactions

## accuracy is overrated, and so is everything else

I wrote (and subsequently cut) about 1500 words for a manuscript we are revising, and decided to turn them into this post - it ended up being longer, and a little more opinionated than I thought. In a nutshell, I want to discuss: how we can assess the performance of binary classifiers for species interaction networks, which measures are informative about the ecological constraints, an why you can safely, in a few situations, toss these measures aside and listen to your colleagues instead.

## Why do we validate?

Predicting species interactions is, from a machine learning standpoint,
a binary classification task. Or in other words, we ask a question with
two possible answers, which are in this case “the two species interact” or
“the two species do not interact”. This question is asked by presenting a
model (the classifier) with a series of information (the features), and
looking for what the outcome is. The precise way in which this question
is answered will, of course, vary as a function of the modelling strategy
used. With a random forest for classification, we get two output values
(`true`

/`false`

). With other methods, we get a single value, which can be
in $[0,1]$, or on another interval. We can also get weights associated to
`true`

or `false`

and run them through *argmax*.

Arguably the most important thing to do with these models is to validate
their output. This is even more important with ecological networks, because
they suffer from a severe unbalance: there are far fewer species pairs that
interact than there are species pairs that do not, *and* although we can
be confident in the *presence* of interactions, *absence* of interactions
usually mean nothing. Let’s put a pin in this idea, we will get back to it.

The net result of these properties of ecological networks is that it is very easy to do a very bad job, as I discussed with the shortcomings of accuracy; doing a good job is a little bit more difficult, and doing a good enough job that we can start thinking about forecasting is currently out of reach. But looking at multiple measures can actually guide our appraisal of the model; for this reason, I wanted to take an hypothetical network prediction, and guide the way we use the model outputs.

All validation measures rely on the *confusion table*, which is simply
a contingency table filled with the false/true positive/negatives,
based on the testing (or validation) dataset. I will not go into the
detail of how each measure is calculated, because these are easy to
find. Broadly speaking,
there are four families of measures we can apply to a binary classifier. For
each of the measures, I will indicate what is a sign of “success”, and a
way to think about them in ecological terms.

## Does it work?

The first family of measures will give us information about the overall skill of the model, in the most basic sense: is it making good predictions.

Measure | Success | Description |
---|---|---|

Random accuracy | Fraction of correct predictions if the classifier is random | |

Accuracy | $\rightarrow 1$ | Observed fraction of correct predictions |

Balanced accuracy | $\rightarrow 1$ | Average fraction of correct positive and negative predictions |

Random accuracy is an interesting case, as it is comparing the observed
accuracy to a model that would “refuse” to guess, and say `true`

half the time,
and `false`

the other half. This is your expected accuracy if the model were an
unbiased coin, and as such it tells you more about the data (and specifically
the effect of prevalence on accuracy) than about the model. Still, you would
expect that the observed accuracy is *larger* than the random one.

Balanced accuracy is also accounting for the predictions of positive and
negative outcomes. If the model is overall good (*e.g.* predicts only
0 because the connectance is low, thereby having an accuracy of 1-L/S²,
the balanced accuracy would penalize it a lot more by, essentially, saying
“yes, but you are missing all the positive links”.

It’s interesting to note that for most species interaction applications,
the random accuracy is in fact a *very* stringent baseline against which to
evaluate the accuracy of a classifier. If a classifier were to behave at random
under our ecological assumptions, it would recommend an interaction L/S²
of the time, and no interaction 1-L/S² of the time (because the most random
network model is that any species pair interact with a probability equal to
connectance). The “ecologically random” accuracy of such a classifier would,
in fact, be much lower than the random accuracy assuming an unbiased coin flip.

## Is the output biased?

The next family of measures are actually the core components of the confusion table: they give information about the biases of the model.

Measure | Success | Description |
---|---|---|

True Positive Rate | $\rightarrow 1$ | Fraction of interactions predicted |

True Negative Rate | $\rightarrow 1$ | Fraction of non-interactions predicted |

False Positive Rate | $\rightarrow 0$ | Fraction of non-interactions predicted as interactions |

False Negative Rate | $\rightarrow 0$ | Fraction of interactions predicted as non-interactions |

These are easy to make sense of: true positive and true negative are the proportion of network positions that were correctly predicted, and the false positive/negative rates are whatever is left. The shape of the confusion matrix is

$$\begin{bmatrix}TP & FP\\ FN & TN \end{bmatrix} ,$$

so if the diagonal sums to 1 (or to the number of predictions made), then the classifier is perfect. The more the trace of this matrix decreases, the worse the classification becomes.

The three confusion matrices below have the same trace, and the same accuracy, but they are making different types of mistakes when predicting interactions:

$$ \begin{bmatrix}90 & 10\\ 0 & 100 \end{bmatrix} \begin{bmatrix}90 & 5\\ 5 & 100 \end{bmatrix} \begin{bmatrix}90 & 0\\ 10 & 100 \end{bmatrix} $$

From left to right, they are predicting some interactions that don’t exist,
predicting as many interactions and non-interactions, and not predicting
interactions that exist. Depending on the things you care about, these may
not be equivalent *at all* (but we will see other ways to look at these
biases a bit further down).

## Can we trust the output?

These three measures are probably the most important. Think of them as taking
a step back, and embracing the fact that there are multiple sources of error,
and they all depend on the prevalence in the data. Most of these do not have
a clear-cut ecological interpretation (beyond their statistical meaning,
that is), but they are *fundamentally important* to deciding if the model
is fit for purpose.

Measure | Success | Description |
---|---|---|

ROC-AUC | $\rightarrow 1$ | Proximity to a perfect prediction (ROC-AUC=1) |

Youden’s J | $\rightarrow 1$ | Informedness of predictions (trust in invidual prediction) |

Cohen’s $\kappa$ | $\ge 0.5$ |

The area under the ROC
curve measures
how closely you can bring the classifier to being a *perfect* classifier,
based on adjustments to the prediction threshold. I like the explanation
in the link, that explains the ROC-AUC as an overlap between the underlying
distribution of features for a positive and negative outcome. The easier it
is to distinguish between the two cases, the higher the ROC-AUC can be. This
is a good measure even for unbalanced data.

Youden’s J (or informedness) is, sort of, your odds of winning a bet against each model prediction. It performs superbly well for the thresholding of unbalanced datasets, and maximizing J is akin to picking the inflection point on the ROC curve.

Finally, Cohens’s $\kappa$ is a measure of the overall “agreement” between
the model predictions and the testing dataset. I takes values in $[-1,1]$,
where $\kappa = -1$ means that you should *always* do the opposite of what the
model says, $\kappa = 0$ means that the model is entirely random (this time
according to the prevalence of interactions, which is to say by respecting
the connectance of the network), and $\kappa = 1$ means that the predictions
are perfect. There is a rich oral history of the cutoffs for an “acceptable”
value of $\kappa$, and usually values about one half are assumed to indicate
“strong agreement”.

## What are we under/over estimating?

Let’s move on to the final step: are we making consistent mistakes during the prediction. These measures are similar in meaning to the raw entries of the confusion matrix, but they incorporate some elements of prevalence in their calculation.

Measure | Success | Description |
---|---|---|

Positive Predictive Value | $\rightarrow 1$ | Confidence in predicted interactions |

Negative Predictive Value | $\rightarrow 1$ | Confidence in predicted non-interactions |

False Omission Rate | $\rightarrow 0$ | Expected proportion of missed interactions |

False Discovery Rate | $\rightarrow 0$ | Expected proportion of wrongly imputed interactions |

The positive and negative predictive values are telling you how much trust
you can have in (respectively) a prediction of “interaction” *vs.* “no
interaction”. In some cases, you may want to optimize one of these values
(as opposed to Youden’s J, but optimizing Youden’s J is essentially going
to balance these two).

The false omission and discovery rates are a little more interesting, because they can tell you how much interactions you should expect your network to have. Let me explain why. FOR is the expectation that a prediction of “no interaction” is wrong, and FDR is the expectation that a prediction of “interaction” is wrong. If you make $n$ predictions, and get $p\times n$ predicted interactions, and therefore $(1-p)\times n$ predicted non-interactions, you can feed these values to the FOR and FDR.

Out of the $p\times n$ predicted interactions, we can assume that $\text{FDR}\times p\times n$ are wrongly imputed, and that $(1-\text{FDR})\times p\times n$ are correct. We can decompose the $(1-p)\times n$ predicted non interactions into $(1-\text{FOR})\times (1-p)\times n$ correct non-interactions, and $\text{FOR}\times (1-p)\times n$ false non-interactions. Go ahead, further assume that these are all independent Bernoulli events, and get the variance for these terms. No one can stop you.

But the net result of these decompositions is as follows. If we had an initial network with $L$ interactions and $S$ species, we made $n = S^2-L$ predictions – this is because we can most likely trust that the interactions that are documented are indeed right, and we suspect that the absence of interactions may be false positives. The actual number of expected interactions is therefore

$$\hat{L} = n\times \left[ \text{FOR}\times (1-p) + (1-\text{FDR}\times p\right]$$

It’s a neat little trick (that has absolutely not been field tested by
anyone!), that can give an estimate of how much interactions you are
missing. This comes, of course, with a gigantic *caveat*: we have no idea
of what is a true negative in most species interaction data, and some of
these measures are going to suffer a poorer performance because of it.

## Conclusion: dissecting a mock prediction

And now, it is time for an example. Let’s move through values for a mock prediction, and see whether we want to use the predicted network:

Measure | Value |
---|---|

Random accuracy | 0.56 |

Accuracy | 0.81 |

Balanced accuracy | 0.80 |

True Positive Rate | 0.77 |

True Negative Rate | 0.83 |

False Positive Rate | 0.16 |

False Negative Rate | 0.22 |

ROC-AUC | 0.86 |

Youden’s J | 0.60 |

Cohen’s $\kappa$ | 0.58 |

Positive Predictive Value | 0.66 |

Negative Predictive Value | 0.89 |

False Omission Rate | 0.10 |

False Discovery Rate | 0.33 |

The first thing of note is the relatively high accuracy: 81% of predictions (overall) are correct. This is a somewhat low value for most classification tasks, but remember that working on interaction data means working with very uncertain negatives, in addition to the low prevalence. This should lead us towards more leniency when determining whether a model is “good”.

The confusion matrix components are relatively good, with 77% of interactions correctly predicted, and 83% of non-interactions correctly predicted. This model is performing far better than expected at random. In fact, the values of Youden’s J and Cohen’s $\kappa$ are both around 0.6, which indicate a “good enough” informedness/agreement. Not bad, not terrible.

The negative predictive value is high: the model isn’t making too many spurious non-interaction recommendations. The positive predictive value is a little lower, but still acceptable.

All in all, I would be comfortable using this model as a basis to extrapolate what the actual network would look like.

## Conclusion 2.0: there is more to biology than numbers

Bzzt, wrongo, there absolutely isn’t.

But let’s take several steps back. We may want to predict networks for two very different purposes. One is getting a detailed map of species interactions; in that case, we want to minimize the wrong predictions as much as possible, because we want to discuss interactions, not “the network”. Another is to discuss “the network”, by putting the predictions through some network measures. In that case, it is likely OK to make some mistakes about where specifically interactions fall.

There is yet a higher level of validation, which is the realism of the network
structure, or its feasibility. Is accuracy really high, but your predicted
network has squirrels eating eagles, or *Homo sapiens* as an endoparasite
of some helminth? Then the model is technically correct (the best kind),
but also worthless. “Biological validation” of predictions is even more
difficult than statistical validation, because we do not have guidelines
to hide behind anymore. In a brilliant post about “evolutionary surprise” and
AI
over on the VERENA blog, Colin Carlson is making a detailed explanation of what
it takes, from the infrastructure to the knowledge of cellular mechanisms,
to correctly frame a prediction of “H5N8 has made the jump to humans” -
long story short, this is an incredibly difficult task, and it will not
scale to entire networks.

But this is where predictions are going to be important; they can serve for triage. They can be thresholded. They can be sent to colleagues who know their biology really well for a vibe check. Investing more into interactions prediction is not a branching process, it is the establishment of a cycle. Knowing when to trust the measures I discussed is important (it is seriously foundational, and there are other still I have not discussed because we are already over 2000 words in). But knowing when to stop trusting them, or when to use them as contextualizing actual biological and ecological knowledge is far more important. The cycle should not be measure-predict-validate. It should be measure-predict-validate-discuss, because the models will never, should never, have the last word on the predictions they make.