Measuring the modularity of ecological networks

Every so often, I receive an email with questions about “the way” to measure modularity in ecological networks. This is something which confused me greatly when I started, and so I thought it could be useful to archive this in a blog post. There are a few different ways of tackling the problem of finding the modular structure, so what follows is how I do it.

There are up to four pieces of information we are after when we search for the modular structure of a network: how modular is it overall, how many modules are there, which species is assigned to which module, and is the overall modular structure different from the chance expectation. Let’s take them in reverse order.

Testing a deviation from the chance expectation requires to generate pseudo-random networks, which is a topic I will get back to soon. Assigning each species to a module is what the analysis really is about, and so that’s what we will focus on. Picking the number of modules and the overall modularity is what we do a posteriori.

Let’s start with an example:

using EcologicalNetwork
using StatPlots

N = convert(BinaryNetwork, web_of_life("M_PL_001"))

I like to think of a modularity analysis as having two parts: first, we generate an initial assignment of species to modules. It doesn’t need to be very good. We can for example use a specified number of random modules:

_, m = n_random_modules(10)(N)
m
Dict{String,Int64} with 185 entries:
  "Anarthrophyllum cumingii"   => 3
  "Unidentified sp4 M_PL_001"  => 2
  "Megachile semirufa"         => 9
  "Mesonychium gayi"           => 1
  "Calceolaria arac"           => 1
  "Chaetodemoticus chilensis"  => 10
  "Panur sp1 M_PL_001"         => 8
  "Mustisia sinuata"           => 8
  "Hypsochila wagenknechti"    => 3
  "Unidentified sp16 M_PL_001" => 2
  "Megachile sp2 M_PL_001"     => 2
  "Lupinus microcarpus"        => 7
  "Salpiglossus sp1 M_PL_001"  => 6
  "Unidentified sp5 M_PL_001"  => 10
  "Leucocoryne pauciflora"     => 7
  "Tapinotaspis caerulea"      => 7
  "Spathipalpus philippii"     => 8
  "Mustisia acerosa"           => 2
  "Quinchamalium chilensis"    => 8
  ⋮                            => ⋮

The _, m syntax I use here is not very pretty, and it’s because the modularity functions return the network and its partition. But in this situation, we can get rid of the network. We can measure the overall modular value of this partition (i.e. the assignment of species to modules), and it should be fairly low:

Q(N,m)

-0.0006445622731562846

We need to refine this, and this is the second step of the analysis. One efficient method for bipartite networks is BRIM:

_, m2 = brim(N,m)

We can check that this actually improved the modular structure:

Q(N,m2)

0.46314868670436843

So, done?

Not quite. The initial assignment was random, and the refining step is therefore constrained by it. Some initial assignment schemes also have a random component to them (label propagation, for example), so it makes sense to run them a lot, until we are satisfied that we have identified the maximally modular structure.

One popular approach is to start with a varying number of random modules, and then look at the output. For example, we can pick between 1 and 50 initial modules, and generate 500 random starting partitions, then refine these with BRIM.

random_modules = rand(1:50, 500)
partitions = [brim(n_random_modules(n)(N)...) for n in random_modules]

We can have a look at the distribution of overall modularity values:

histogram(map(p -> Q.(p...), partitions), c=:lightgrey, lc=:grey, bins=50, frame=:box, leg=false)
xaxis!((0,1), "Modularity")
yaxis!((0,100))

The most modular partition would be at the right end of this distribution. We can also check the relationship between number of modules and modularity:

nmod = map(p -> length(unique(collect(values(p[2])))), partitions)
qmod = map(p -> Q.(p...), partitions)
scatter(nmod, qmod, frame=:box, c=:grey, msw=0, alpha=0.4, leg=false)
xaxis!((0,50), "Number of modules")
yaxis!((0,1), "Modularity")

This is a good diagnostic plot: in this situation, it shows that partitions with 10 to 40 modules have approximately the same modularity. Picking the “very best” one may require to bump up the number of simulations!

Let’s look at the very best:

best_idx = filter(i -> qmod[i] == maximum(qmod), eachindex(qmod))
nmod[best_idx]
1-element Array{Int64,1}:
 24

For this particular network, the best partition has 24 modules. But have a look the plot again: this result is difficult to trust, because there are a large number of possible configurations and number of modules that give very similar (high) modularity values. In fact, in the process of writing this post, I got different values every time (it eventually converged when I bumped the number of replicates to 50000, and I of course forgot to save this result).


To summarize:

Detecting the modular structure can be decomposed in three steps. First, we generate a random assingment using either random modules, or a heuristic like LP. Second, we optimize this assignment using a method like BRIM (there are others, but in my experience, BRIM works extremely well for ecology-sized networks). Finally, we compare many runs of this process, and pick the partition which maximizes modularity. If the maximal modularity and number of modules varies in subsequent runs, then the number of replicates used is not high enough.