Ecological networks, minus the species

Calculus is a preferable alternative to natural history

Do you know what is a problem when dealing with species interaction networks? Species. Always with the confusing names, and the uncertainty, and the detectability, and all of the associated issues. Life would be so much simpler if we could just get rid of them and focus on the coll stuff, which is to say, the structure of the network. Astute readers might notice how this is a problem, as species interaction networks are literally defined as species, linked together by interactions. But fear not, we are going to mathemagically make the species disappear!

In this post, I will explore an idea I have been playing with for many more years that I am comfortable being explicit about (it’s 8, at least): can we approximate the structure of ecological networks in the situations where we have no information, or no knowledge, on species that are found locally? The answer is “yes, under a limiting set of hypotheses”, and I will present some clues as to why (what I will not do is cite the dozens or so papers that support this entire argument, because this is not a draft, merely a set of notes that are maybe leading up to a draft). Because this is a complex topic, I will divide it in three sections: “lies”, where I will make some simplifying assumptions about networks; “damned lies”, where I will make gross oversimplifcations abouts interactions and how they happen; and finally, “statistics”, where I will borrow the credibility of actual sciences to make these assumptions seem plausible.

Lies

We will simplify the problem by assuming that an ecological network can be represented by its adjacency matrix $\mathbf{A}$, in which $A_{ij}$ is 1 if there is an interaction from species $i$ to species $j$, and 0 otherwise. We know that observing this interaction is the outcome of a probabilistic event, which we should actually express as $\text{Pr}(A_{ij}|C_{ij})$, which is the probability of seeing these two species interact assuming that they are together at the same place ($C_{ij}$).

The first assumption we will make is therefore that we are working within one local pool of species where there is no constraint on co-occurrence. This allows us to assume that $C_{ij} = 1$ for all pairs of species, and so we can start questioning why species interact.

This part is easy. Species interact when they encounter (both in space, which we solved by assuming they do) and locally, which is often summarized as the “neutral” component of network structure. If there are 10000 individuals of species $i$, and 4 of species $j$, an interaction involving $i$ is much more likely to happen and/or be observed. When they encounter, species interact if their traits (and I won’t define this word here because it’s not really important) allow the interaction to happen. This can be a match between gape size and body size, proboscis length and flower depth, viral and host proteins, but it might as well be the projection of various information on species in a multivariate space. Nothing in here is conceptually new; Claude Combes has been writing on this for about as long as I can remember, and I used this idea in more than a few ecological networks papers.

Assuming that we can forget about the issue with encounters, interactions ($A_{ij}$) are simply the result of a function ($L_{ij}$), which tell us if two species are compatible. This is the unmovable, constraint free component of network structure. What does such a function look like? It would take some traits are argument, like $L(x_i, y_j)$, and return either 0, or 1. Easy peasy.

This concludes the “lies” section of this post. Now to make up some real nonsense.

Damned lies

Let’s assume that we have two groups of species (hosts and parasites), where the parasites have a trait whose value comes from a distribution $X$, and the hosts have a trait whose values comes from a distribution $Y$. And let us further assume that we care about parasite $i$, with trait value $x_i$, and host $j$, with trait value $y_j$. These two species have an entry in the matrix $L_{ij}$ which is given by $L(x_i, y_j)$. This is assuming, of course, that they will run into one another. How probable is this event? Well, as we know $X$ and $Y$, we can say that it happens with probability $\text{Pr}(X=x_j|Y=y_j)$, and as we have no reason to assume that these two event are linked by anything, we can write this as $\text{Pr}(X=x_i)\times\text{Pr}(Y=y_j)$.

This is nice! It means that we can rewrite our adjacency matrix $A_{ij}$ as something which is roughly equivalent to $L_{ij}\times\text{Pr}(X=x_i)\times\text{Pr}(Y=y_j)$. Before we move on, let’s simplify the notation a little bit, by noting $\text{Pr}(n_m) = \text{Pr}(N=n_m)$, which will be faster to type and more compact to read. This leaves us with the assumption that

$$A_{ij} \approx L(x_i, y_j)\times\text{Pr}(x_i)\times\text{Pr}(y_j)$$

At this point, we have not really solved the problem of species being a thing. But we are almost done. Because we have expressed the interaction itself as a function on traits, and traits as a distribution, we can move on to the next level of tomfoolery, “statistics”.

Statistics

The first question to ask about a network is “how many interactions does it have?". Because this is related to the number of species, we can phrase the exact same question as “what is its connectance?", or in other words, what is the probability that a pair of species can interact?

We do not need species, or pairs of species, to answer this question!

The expected connectance of a network reduced to a distribution of traits is

$$Co = \int_x \int_y L(x,y) f_X(x) f_Y(y) \cdot dy \cdot dx ,$$

where $f_N$ is the probability density function for a distribution $N$.

Let’s start with a little simulation, shall we? We will use the fantastic Distributions package in Julia to generate some fun results.

using Distributions
using StatsPlots

Now, let’s set up a biological example. We have a virus with an infectivity trait, and a cell with a resistance trait, and the interaction takes places if the resistance is within some range around the infectivity value. We also will make the assumption that these traits only take values in $[0,1]$. Our function $L(x,y)$ is therefore:

L(x::T, y::T; r::T=0.1) where {T<:Number} = (x-r/2.0) ≤ y ≤ (x+r/2.0)  ? one(T) : zero(T)
L (generic function with 1 method)

We need to define a distribution for the infectivity, and for the resistance, and we will set them both as Betas:

infectivity = Beta(6.0, 8.0)
resistance = Beta(2.0, 8.0)
Distributions.Beta{Float64}(α=2.0, β=8.0)

They have the same dispersal, but different modes, and specifically the resistance of the host is lower than the infectivity of the virus. We can make the hypothesis that the connectance in this system will therefore be large, i.e. larger than what would be expected if both traits would be uniformly distributed, which would be about one minus the range of values in which infection occurs.

We will need to integrate, and so a quick and dirty method that will work well enough is the trapezoidal rule, by which $\int_a^{b}f(x)dx \approx (b-a)(f(a)+f(b))(1/2)$, which can be done in a discrete way by picking a suitably small value for $(b-a)$.

function ∫(x::Array{T},y::Array{T}) where {T <: Number}
  @assert length(x) == length(y)
  S = zero(Float64)
  for i in 2:length(x)
    S += (x[i]-x[i-1])*(y[i]+y[i-1])*0.5
  end
  return S
end
∫ (generic function with 1 method)

We can run a quick check that this works, by checking that the integral of the PDF of the distribution of resistance over the entire interval where it is defined is very close to 1:

τ = collect(0.0:0.001:1.0)
y = pdf.(resistance, τ)
∫(τ, y)
0.999994000012601

At this point, we can measure the connectance of our network. The first question to adress is the distribution of encounters as a function of trait values, and maybe represent it as a matrix:

P = pdf.(infectivity, τ) * pdf.(resistance, τ)'
heatmap(τ, τ, P,
  c=:YlGn,
  aspectratio=1, frame=:box,
  xlim=(0,1), ylim=(0,1),
  xlabel="Resistance", ylabel="Infectivity"
  )

So, what is the expected connectance? Well, we would need to add the information about the actual outcome of the interaction while the two species meet:

A = L.(τ, τ').*P

At this point, we could do something fancy to integrate on both dimensions of this thing, like using the operator that Julia has, but hey, the connectance is simply:

mean(A)
0.09242403318672693

This is relatively large, because the resistance values of hosts are lower, on average, than the infectivity values of the virus. We will confirm this result in a moment, but first I would like to talk about degree distribution.

One interesting question we can now ask is, assuming we get a virus with trait $x$, what proportion of hosts will it be able to infect? This is pleasantly straightforward to calculate, as

$$k|_x = \int_y L(x,y) f_Y(y) \cdot dy$$

Similarly, the proportion of viruses able to infect an host with trait value $y$ is

$$k|_y = \int_x L(x,y) f_X(x) \cdot dx$$

So what if $x = 0.1$?

x = 0.1
kx = ∫(τ, L.(x, τ).*pdf.(resistance, τ))
0.3322944732231605

We can even look at the predicted generality (number of hosts) and vulnerability (number of viruses) by repeating this over multiple values of $x$ and $y$:

Kx = [∫(τ, L.(x, τ).*pdf.(resistance, τ)) for x in τ]
Ky = [∫(τ, L.(τ, y).*pdf.(infectivity, τ)) for y in τ]
plot(
  τ, Kx,
  lab = "Generality",
  xlim = (0,1), ylim = (0,1),
  xlab = "Trait value",
  frame=:box, c=:teal, lw=2.0
  )
plot!(
  τ, Ky,
  lw = 2.0,
  c=:purple, lab="Vulnerability"
  )

All things considered, we expect that viruses with a trait value closer to the average of the host trait distribution will have more interactions, and vice-versa, which is very neatly reflected in this figure.

Simulations

That’s right; time for the fourth circle of hell, wherein we will verify that everything discussed above works. Simulating a network like this is not too difficult - we need to draw a series of traits, and then apply the $L$ function, and then this should be it.

S = (93, 177)
t_virus = sort(rand(infectivity, S[1]))
t_host = sort(rand(resistance, S[2]))
Asim = L.(t_virus, t_host')
heatmap(Asim,
  c=:Greys, leg=false, aspectratio=1,
  xlab="Hosts", ylab="Viruses", frame=:box,
  xlim=(1,S[2]), ylim=(1,S[1]))

The connectance of this network is

sum(Asim)/prod(size(Asim))
0.08924123686288804

You might recognize this value as being generally very close to what we expected the connectance of the network to be for any number of species: 0.092. In fact, let’s play a fun game, and see what the effect of the number of species is on our estimate - by plotting the mean and standard deviation, courtesy of this horrendous code:

x = []
ym = []
ys = []
for s in Int.(round.(10.0.^collect(1:0.2:3); digits=0))
  cos = zeros(Float64, 50)
  for i in 1:length(cos)
    t_virus = sort(rand(infectivity, s))
    t_host = sort(rand(resistance, s))
    Arep = L.(t_virus, t_host')
    cos[i] = mean(Arep)
  end
  push!(x, s)
  push!(ym, mean(cos))
  push!(ys, std(cos))
end
scatter(x, ym, yerr=ys, c=:black, lab="", xlab="Number of species", ylab="Connectance")
xaxis!(:log10)
hline!([mean(A)], c=:grey, ls=:dot, lab="")

Even though the standard deviation is large in small networks, the average estimate is never far off (notice the $y$ axis values). How do we fare on the degree distribution? The prediction is the line, and the observed values are the dots:

plot(τ, Kx.*S[2], c=:teal, lw=2.0, lab="", ylab="Expected degree", xlab="Virus trait value")
scatter!(t_virus, sum(Asim; dims=2), c=:teal, msw=0.0, lab="")
yaxis!((0, S[2]))
xaxis!((0,1))

This is not too bad!

A conclusion

What was the point?

Well, there are a few instances where getting data about species is difficult, but getting data about trait distribution is not. In these cases, we might still want to look at food web properties, and it seems that this is possible even in the absence of species knowledge. If we have models that are really good at predicting traits, then we might get a few interactions to infer the $L$ function, and start making predictions about the network. Coming up with scenarios where this approach is useful is not difficult.

There are a few remaining obstacles (and that’s putting it mildly, and I won’t spoil the ending because this will be discussed in the paper), but it is at least encouraging to see that we do not need perfect knowledge about the species within a network to say something about the network itself.