A recommender for species interactions

What are we having for dinner?

Ecological networks are not that complicated. To illustrate this, we will see how easily we can use Julia to build a recommender system to suggest suitable preys to predators, using the k nearest neighbors algorithm. This post is largely based on @DesjLaig17, which I suggest you take a brief look at (it’s free as in we paid the APC).

The data we have is a list of preys eaten by various predators, which we can get from GitHub and store locally:

# Prepare a folder to store the data
ispath("data") || mkdir("data")
data_folder = joinpath("data", "knn-interactions")
ispath(data_folder) || mkdir(data_folder)

# Download the data
_project_root = "https://raw.githubusercontent.com/PhDP/EcoInter/master/KNN/data/"
int_file = joinpath(data_folder, "interactions.csv")
isfile(int_file) || download(_project_root*"mercure-interactions.csv", int_file)

Every line is a comma separated list of preys, so we can simply turn it into a set, after filtering out the lines with no preys on them:

preys = Set.(split.(filter(l -> length(l)>0, readlines(int_file)), ", "));

Let’s check the diet of the first species:

first(preys)
Set(SubString{String}["405", "432", "392", "391"])

In order to recommend suitable preys, we need to have a measure of distance. We will use one based on the Tanimoto similarity, which is

$$ \text{Tanimoto}(x,y) = \frac{|x\cap y|}{|x\cup y|} $$

Note how close the function is to the mathematical notation, thanks to the unicode goodness:

tanimoto(x::Set{T}, y::Set{T}) where {T} = length(x∩y)/length(x∪y)
tanimoto (generic function with 1 method)

As this is a similarity measure, we will turn it into a distance by substracting it from 1:

D(x::Set{T}, y::Set{T}) where {T} = 1.0 - tanimoto(x,y)
D (generic function with 1 method)

We now have all we need to perform our analysis! Specifically, we want to suggest suitable preys to an organism, based on what organisms with similar diets are eating. To do this, we will define “similar” as being the k closest neighbors, and then we will recommend preys based on how common they are in the diet of these k neighbors. If we need a small number of neighbors to get a good result, then networks are not actually that complicated. Let’s start.

First, because we will use random removal of a species to test our approach, we need to limit ourselves to species with more than 2 preys:

filter!(d -> length(d)>2, preys);

For each of the remaining species, we want to pick one of its preys, remove it, and then look for the closest neighbors absed on the incomplete diet. We can do this using sample, so let’s load StatsBase:

import StatsBase

Let’s define a function to remove a species from the diet. We could do this by adding a method to StatsBase.sample, which would be:

StatsBase.sample(s::T, i::Integer; k...) where {T <: Set} = Set(sample(collect(s), i; k...))

But in tour case, we want to keep track of the identity of the species that was removed, so as to see if our recommendation makes sense, and so we will use something slightly more (but barely) complex:

function leave_one_prey_out(x::T) where {T <: Set}
  kept_preys = StatsBase.sample(collect(x), (length(x)-1); replace=false)
  lost_prey = filter(x -> !(x in kept_preys), collect(x))
  return Set(kept_preys), first(lost_prey)
end
leave_one_prey_out(preys[1])
(Set(SubString{String}["405", "392", "391"]), "432")

We can now get the distances between all other predators and the remaining preys in the diet:

kept, lost = leave_one_prey_out(preys[1])
distances = [D(kept, p) for p in preys[2:end]];

The distances array has the distance between the incomplete diet, and the other (complete) ones. We now need to get the positions that correspond to the closest k neighbors; let’s assume we use $k=5$:

k = 5
closest = partialsortperm(distances, 1:k)
5-element view(::Array{Int64,1}, 1:5) with eltype Int64:
  1
  6
  8
  9
 12

At this point, were we doing a real research project, we could shuffle diets to avoid re-selecting the earliest ones, etc, but this will do for now. The closest neighbors diets are:

closest_diets = preys[2:end][closest]
5-element Array{Set{SubString{String}},1}:
 Set(["405", "432", "392", "391"])
 Set(["405", "432", "392", "391"])
 Set(["405", "432", "392", "391"])
 Set(["405", "432", "392", "391"])
 Set(["405", "432", "392", "391"])

So, this is easy! Our species of interest has 5 closest neighbors with the same diet. To see which one we would recommend, let’s still write some code to count their frequencies:

recommendations = StatsBase.countmap(vcat(collect.(closest_diets)...))
Dict{SubString{String},Int64} with 4 entries:
  "405" => 5
  "432" => 5
  "392" => 5
  "391" => 5

At this point, a common question is How often would the correct species be recommended in the top $n$ choices? So let’s be crafty:

n = 5
picks = zeros(Bool, n)
for rec_number in 1:n
  if length(recommendations) > 0
    # We pick the group with the most votes
    cmax = filter(f -> f.second == maximum(values(recommendations)), recommendations)
    # Then we check if the species we removed is in it
    picks[rec_number] = lost in collect(keys(cmax))
    # Finally we remove all preys with the most vote
    filter!(f -> f.second < maximum(values(recommendations)), recommendations)
  end
end

If the correct prey was found, then the picks array will have a non-false value somewhere, so we can say that a success in prediction is when:

any(picks)
true

Let’s now wrap this up into a function, for the entire dataset:

function dinnertime(preys; k::Int64=3, n::Int64=5)
  @assert 1 ≤ k ≤ (length(preys)-1)
  @assert 1 ≤ n ≤ 20 # This is arbitrary
  success = zeros(Bool, length(preys))
  for (i,diet) in enumerate(preys)
    kept, lost = leave_one_prey_out(diet)
    other_preys = preys[filter(idx -> idx ≠ i, 1:length(preys))]
    distances = [D(kept, p) for p in other_preys]
    closest = partialsortperm(distances, 1:k)
    closest_diets = other_preys[closest]
    recommendations = StatsBase.countmap(vcat(collect.(closest_diets)...))
    picks = zeros(Bool, n)
    for rec_number in 1:n
      if length(recommendations) > 0
        # We pick the group with the most votes
        cmax = filter(f -> f.second == maximum(values(recommendations)), recommendations)
        # Then we check if the species we removed is in it
        picks[rec_number] = lost in collect(keys(cmax))
        # Finally we remove all preys with the most vote
        filter!(f -> f.second < maximum(values(recommendations)), recommendations)
      end
    end
    success[i] = any(picks)
  end
  return sum(success)/length(success)
end
dinnertime (generic function with 1 method)

Let’s now try this, with the default of $k=3$ and the top 5 votes:

dinnertime(preys)
0.8688046647230321

And that’s it! There are a bunch of things we would want to fix to actually *use this - yet, with only three neighbors, we can recommend the correct preys *amongst the first five picks about 85% of the time. As we discuss in *@DesjLaig17, things get even better when we use some information on traits, but *there is an impressive amount of information stored in the networks themselves. *And extracting it takes about a dozen lines of code.