Notes on the Gillespie algorithm - part 1

These past weeks, I was working on things related to mutations and whatnot. In the course of exploratory work, I had to revisit some things I last read about during my PhD: how do we decide when a mutation will happen, and who will mutate? This post is an attempt at organizing some notes I took about the Gillespie algorithm.

The @Gill77 algorithm is a way to calculate how long in the future ($\delta_t$) a new event will take place, given the number of possible events ($k$), and the rate at which the event we are looking for happens ($\mu$). With the example of mutations, this becomes relatively straightforward: $k$ is the number of offsprings, and $\mu$ is the probability that a given offspring has a mutation in the trait of interest. The product of $k$ and $\mu$ is the rate ($r$). If there are several types of events, each with their own rate, then $r = \sum_i k_i \mu_i$.

In the Gillespie algorithm, the next timestep $\delta_t$ is drawn at random from an exponential distribution. We want the value of $\delta_t$ to become shorter when $k$ or $\mu$ increase (i.e. more potential events or a higher chance of each event happening). One property of the exponential distribution is that $\text{E}[X] = 1/\lambda$, where $\lambda$ is the parameter of the distribution.

If we assume that each offspring will have a 20% chance of mutation ($k=1$, $\mu = 0.2$, $r=0.2$), then we can expect to have a mutation every five generations ($E[X]=1/r=5$). If we double the number of offsprings ($k=2$, $r=0.4$), then we should expect to have a mutation every 2.5 generations: $E[X]=1/r=2.5$. In short, when we know the rate $r$, we draw $\delta_t$ in an exponential distribution of parameter $\lambda = r$. Depending the software used, the exponential distribution function will use either the actual parameter $\lambda$, or the scale ($1/\lambda$). Be careful.

Finding $\delta_t$ is the first step. When there are multiple populations involved, the next question is to decide which will mutate. The Gillespie algorithm assumes that all events are completely random and completely independant, and so all we need to do is to draw an event to happen this timestep. Assuming we have several populations with their own $k_i$ (and the same mutation probability $\mu$), the probability of every population mutating is

$$p_i = \frac{k_i\mu}{\sum_j k_j\mu}$$

We can sample a population $i$ to mutate, by applying the weights $p_i$.

I guess it’s time for an illustration! Because setting up the code for a proper example with mutations is a little bit overwhelming, I will use something simpler: the growth of two populations of bacteria. The event we want to see happen is a cellular division, and it happens at a rate $\mu$ roughly equal to the number of divisions per hour (now is a good time to mention that the values of everything in the Gillespie algorithm are dependent on the units in which the model is expressed). This example is not the most biologically appropriate (bacterial division happens after the cell has grown enough, and so events are not random), it will do for an overview.

If a bacteria doubles on average every 20 minutes, then its rate of doubling is $3 \text{hr}^{-1}$. Starting from a single cell at $t_0=0.0$, we expect to have a division (and therefore two cells) at $t_1 = t_0+\delta_t \approx 20 \text{min}$. The next division, at which point we will have three cells, will take place at $t_2 = t_1 + \delta_t \approx 30 \text{min}$ – at $t_1$, we have $k=2$, $\mu=3$, and therefore $r=6$, which means that $\delta_t$ is $r^{-1}=6^{-1}$ hours. The event at $t_1$ is the division of the initial cell. The event at $t_2$ is the division of either daughter cells. We then expect to have a second doubling of the population when there are four cells, which happens at $t_2+\delta_t$, where $\delta_t$ is $9^{-1}$ – this is approximately 7 minutes. The population will double first after 20 minutes, then after another 17 minutes have passed.

Let’s write this up (in Julia). We will need a few packages:

using Distributions
using StatsBase
using StatPlots

Next, let’s write a function to calculate $\delta_t$:

function gillespie(k, μ)
    r = sum(k.*μ)
    return rand(Exponential(1/r))

And a function to simulate the growth of populations:

function simulate(n, g; nmax=1000)
    # Will store the time of events
    t = [0.0]
    # Will store the number of individuals
    N = [copy(n)]
    while (sum(last(N)) < nmax)
        # When is the next step?
        δt = gillespie(n, g)
        # Which population should have a division?
        double = sample(1:length(n), weights(n.*g))
        # We increase the size of the dividing population, and push!
        n[double] += 1
        push!(N, copy(n))
        push!(t, last(t)+δt)
    return (t, N)

Let’s test our little prediction that the population will have doubled twice ($n=4$) after 37 minutes (even with the awfuly non-optimal code, this all runs within a second):

time_until_4 = [last(simulate([1], [3], nmax=4)[1])*60 for i in 1:200000]

Gillespie sampling{: .center-image}

The vertical line is 37 minutes (the average time in this run was 36.9999… minutes).

Let’s summarize – the Gillespie algorithm works by using the number of possible events, and the rate at which events happens, to randomly generate the time until the next event. Events are picked with a probability relative to their rate, and are assumed to be independant and entirely random.

One of the issues I ran into when using this algorithm is that the rate of events must be expressed in a unit that is right with regard to both the population sizes and the time. In a surprising number of ecological models, both of these quantities are dimensionless. For example, in the case of mutations, the $\mu$ parameter is the rate of mutation per offspring; but in (for example), the prey-predator Lotka-Volterra model, the number of offspring of a predator $y$ consuming a prey $x$ at rate $\delta$ is not simply $\delta\times x\times y$ (or rather it is, but this is not a quantity expressed in number of individuals). This can become problematic when we need to be relatively confident in the fact that the tempo of evolutionary changes matches the tempo of population dynamics. But this is a story [for another day][part-2]…

[part-2]: {% link _posts/ %}