Validation metrics for the prediction of species interactions

accuracy is overrated, and so is everything else

I wrote (and subsequently cut) about 1500 words for a manuscript we are revising, and decided to turn them into this post - it ended up being longer, and a little more opinionated than I thought. In a nutshell, I want to discuss: how we can assess the performance of binary classifiers for species interaction networks, which measures are informative about the ecological constraints, an why you can safely, in a few situations, toss these measures aside and listen to your colleagues instead.

Why do we validate?

Predicting species interactions is, from a machine learning standpoint, a binary classification task. Or in other words, we ask a question with two possible answers, which are in this case “the two species interact” or “the two species do not interact”. This question is asked by presenting a model (the classifier) with a series of information (the features), and looking for what the outcome is. The precise way in which this question is answered will, of course, vary as a function of the modelling strategy used. With a random forest for classification, we get two output values (true/false). With other methods, we get a single value, which can be in $[0,1]$, or on another interval. We can also get weights associated to true or false and run them through argmax.

Arguably the most important thing to do with these models is to validate their output. This is even more important with ecological networks, because they suffer from a severe unbalance: there are far fewer species pairs that interact than there are species pairs that do not, and although we can be confident in the presence of interactions, absence of interactions usually mean nothing. Let’s put a pin in this idea, we will get back to it.

The net result of these properties of ecological networks is that it is very easy to do a very bad job, as I discussed with the shortcomings of accuracy; doing a good job is a little bit more difficult, and doing a good enough job that we can start thinking about forecasting is currently out of reach. But looking at multiple measures can actually guide our appraisal of the model; for this reason, I wanted to take an hypothetical network prediction, and guide the way we use the model outputs.

All validation measures rely on the confusion table, which is simply a contingency table filled with the false/true positive/negatives, based on the testing (or validation) dataset. I will not go into the detail of how each measure is calculated, because these are easy to find. Broadly speaking, there are four families of measures we can apply to a binary classifier. For each of the measures, I will indicate what is a sign of “success”, and a way to think about them in ecological terms.

Does it work?

The first family of measures will give us information about the overall skill of the model, in the most basic sense: is it making good predictions.

Measure Success Description
Random accuracy Fraction of correct predictions if the classifier is random
Accuracy $\rightarrow 1$ Observed fraction of correct predictions
Balanced accuracy $\rightarrow 1$ Average fraction of correct positive and negative predictions

Random accuracy is an interesting case, as it is comparing the observed accuracy to a model that would “refuse” to guess, and say true half the time, and false the other half. This is your expected accuracy if the model were an unbiased coin, and as such it tells you more about the data (and specifically the effect of prevalence on accuracy) than about the model. Still, you would expect that the observed accuracy is larger than the random one.

Balanced accuracy is also accounting for the predictions of positive and negative outcomes. If the model is overall good (e.g. predicts only 0 because the connectance is low, thereby having an accuracy of 1-L/S², the balanced accuracy would penalize it a lot more by, essentially, saying “yes, but you are missing all the positive links”.

It’s interesting to note that for most species interaction applications, the random accuracy is in fact a very stringent baseline against which to evaluate the accuracy of a classifier. If a classifier were to behave at random under our ecological assumptions, it would recommend an interaction L/S² of the time, and no interaction 1-L/S² of the time (because the most random network model is that any species pair interact with a probability equal to connectance). The “ecologically random” accuracy of such a classifier would, in fact, be much lower than the random accuracy assuming an unbiased coin flip.

Is the output biased?

The next family of measures are actually the core components of the confusion table: they give information about the biases of the model.

Measure Success Description
True Positive Rate $\rightarrow 1$ Fraction of interactions predicted
True Negative Rate $\rightarrow 1$ Fraction of non-interactions predicted
False Positive Rate $\rightarrow 0$ Fraction of non-interactions predicted as interactions
False Negative Rate $\rightarrow 0$ Fraction of interactions predicted as non-interactions

These are easy to make sense of: true positive and true negative are the proportion of network positions that were correctly predicted, and the false positive/negative rates are whatever is left. The shape of the confusion matrix is

$$\begin{bmatrix}TP & FP\\ FN & TN \end{bmatrix} ,$$

so if the diagonal sums to 1 (or to the number of predictions made), then the classifier is perfect. The more the trace of this matrix decreases, the worse the classification becomes.

The three confusion matrices below have the same trace, and the same accuracy, but they are making different types of mistakes when predicting interactions:

$$ \begin{bmatrix}90 & 10\\ 0 & 100 \end{bmatrix} \begin{bmatrix}90 & 5\\ 5 & 100 \end{bmatrix} \begin{bmatrix}90 & 0\\ 10 & 100 \end{bmatrix} $$

From left to right, they are predicting some interactions that don’t exist, predicting as many interactions and non-interactions, and not predicting interactions that exist. Depending on the things you care about, these may not be equivalent at all (but we will see other ways to look at these biases a bit further down).

Can we trust the output?

These three measures are probably the most important. Think of them as taking a step back, and embracing the fact that there are multiple sources of error, and they all depend on the prevalence in the data. Most of these do not have a clear-cut ecological interpretation (beyond their statistical meaning, that is), but they are fundamentally important to deciding if the model is fit for purpose.

Measure Success Description
ROC-AUC $\rightarrow 1$ Proximity to a perfect prediction (ROC-AUC=1)
Youden’s J $\rightarrow 1$ Informedness of predictions (trust in invidual prediction)
Cohen’s $\kappa$ $\ge 0.5$

The area under the ROC curve measures how closely you can bring the classifier to being a perfect classifier, based on adjustments to the prediction threshold. I like the explanation in the link, that explains the ROC-AUC as an overlap between the underlying distribution of features for a positive and negative outcome. The easier it is to distinguish between the two cases, the higher the ROC-AUC can be. This is a good measure even for unbalanced data.

Youden’s J (or informedness) is, sort of, your odds of winning a bet against each model prediction. It performs superbly well for the thresholding of unbalanced datasets, and maximizing J is akin to picking the inflection point on the ROC curve.

Finally, Cohens’s $\kappa$ is a measure of the overall “agreement” between the model predictions and the testing dataset. I takes values in $[-1,1]$, where $\kappa = -1$ means that you should always do the opposite of what the model says, $\kappa = 0$ means that the model is entirely random (this time according to the prevalence of interactions, which is to say by respecting the connectance of the network), and $\kappa = 1$ means that the predictions are perfect. There is a rich oral history of the cutoffs for an “acceptable” value of $\kappa$, and usually values about one half are assumed to indicate “strong agreement”.

What are we under/over estimating?

Let’s move on to the final step: are we making consistent mistakes during the prediction. These measures are similar in meaning to the raw entries of the confusion matrix, but they incorporate some elements of prevalence in their calculation.

Measure Success Description
Positive Predictive Value $\rightarrow 1$ Confidence in predicted interactions
Negative Predictive Value $\rightarrow 1$ Confidence in predicted non-interactions
False Omission Rate $\rightarrow 0$ Expected proportion of missed interactions
False Discovery Rate $\rightarrow 0$ Expected proportion of wrongly imputed interactions

The positive and negative predictive values are telling you how much trust you can have in (respectively) a prediction of “interaction” vs. “no interaction”. In some cases, you may want to optimize one of these values (as opposed to Youden’s J, but optimizing Youden’s J is essentially going to balance these two).

The false omission and discovery rates are a little more interesting, because they can tell you how much interactions you should expect your network to have. Let me explain why. FOR is the expectation that a prediction of “no interaction” is wrong, and FDR is the expectation that a prediction of “interaction” is wrong. If you make $n$ predictions, and get $p\times n$ predicted interactions, and therefore $(1-p)\times n$ predicted non-interactions, you can feed these values to the FOR and FDR.

Out of the $p\times n$ predicted interactions, we can assume that $\text{FDR}\times p\times n$ are wrongly imputed, and that $(1-\text{FDR})\times p\times n$ are correct. We can decompose the $(1-p)\times n$ predicted non interactions into $(1-\text{FOR})\times (1-p)\times n$ correct non-interactions, and $\text{FOR}\times (1-p)\times n$ false non-interactions. Go ahead, further assume that these are all independent Bernoulli events, and get the variance for these terms. No one can stop you.

But the net result of these decompositions is as follows. If we had an initial network with $L$ interactions and $S$ species, we made $n = S^2-L$ predictions – this is because we can most likely trust that the interactions that are documented are indeed right, and we suspect that the absence of interactions may be false positives. The actual number of expected interactions is therefore

$$\hat{L} = n\times \left[ \text{FOR}\times (1-p) + (1-\text{FDR}\times p\right]$$

It’s a neat little trick (that has absolutely not been field tested by anyone!), that can give an estimate of how much interactions you are missing. This comes, of course, with a gigantic caveat: we have no idea of what is a true negative in most species interaction data, and some of these measures are going to suffer a poorer performance because of it.

Conclusion: dissecting a mock prediction

And now, it is time for an example. Let’s move through values for a mock prediction, and see whether we want to use the predicted network:

Measure Value
Random accuracy 0.56
Accuracy 0.81
Balanced accuracy 0.80
True Positive Rate 0.77
True Negative Rate 0.83
False Positive Rate 0.16
False Negative Rate 0.22
ROC-AUC 0.86
Youden’s J 0.60
Cohen’s $\kappa$ 0.58
Positive Predictive Value 0.66
Negative Predictive Value 0.89
False Omission Rate 0.10
False Discovery Rate 0.33

The first thing of note is the relatively high accuracy: 81% of predictions (overall) are correct. This is a somewhat low value for most classification tasks, but remember that working on interaction data means working with very uncertain negatives, in addition to the low prevalence. This should lead us towards more leniency when determining whether a model is “good”.

The confusion matrix components are relatively good, with 77% of interactions correctly predicted, and 83% of non-interactions correctly predicted. This model is performing far better than expected at random. In fact, the values of Youden’s J and Cohen’s $\kappa$ are both around 0.6, which indicate a “good enough” informedness/agreement. Not bad, not terrible.

The negative predictive value is high: the model isn’t making too many spurious non-interaction recommendations. The positive predictive value is a little lower, but still acceptable.

All in all, I would be comfortable using this model as a basis to extrapolate what the actual network would look like.

Conclusion 2.0: there is more to biology than numbers

Bzzt, wrongo, there absolutely isn’t.

But let’s take several steps back. We may want to predict networks for two very different purposes. One is getting a detailed map of species interactions; in that case, we want to minimize the wrong predictions as much as possible, because we want to discuss interactions, not “the network”. Another is to discuss “the network”, by putting the predictions through some network measures. In that case, it is likely OK to make some mistakes about where specifically interactions fall.

There is yet a higher level of validation, which is the realism of the network structure, or its feasibility. Is accuracy really high, but your predicted network has squirrels eating eagles, or Homo sapiens as an endoparasite of some helminth? Then the model is technically correct (the best kind), but also worthless. “Biological validation” of predictions is even more difficult than statistical validation, because we do not have guidelines to hide behind anymore. In a brilliant post about “evolutionary surprise” and AI over on the VERENA blog, Colin Carlson is making a detailed explanation of what it takes, from the infrastructure to the knowledge of cellular mechanisms, to correctly frame a prediction of “H5N8 has made the jump to humans” - long story short, this is an incredibly difficult task, and it will not scale to entire networks.

But this is where predictions are going to be important; they can serve for triage. They can be thresholded. They can be sent to colleagues who know their biology really well for a vibe check. Investing more into interactions prediction is not a branching process, it is the establishment of a cycle. Knowing when to trust the measures I discussed is important (it is seriously foundational, and there are other still I have not discussed because we are already over 2000 words in). But knowing when to stop trusting them, or when to use them as contextualizing actual biological and ecological knowledge is far more important. The cycle should not be measure-predict-validate. It should be measure-predict-validate-discuss, because the models will never, should never, have the last word on the predictions they make.