Picking the most efficient flower-picking route with simulated annealing

I spent last week working on a new projet, which involved generating networks with fixed structural properties. I experimented with some approaches, but simulated annealing gave by far the best results. I thought it would be a nice opportunity to aggregate some notes and code here. We will see the way temperature decreases to constrain optimization, how to decide on accepting a change of state as a function of temperature, and then an illustration on species occurrence data.

Simulated annealing is one of these methods which look simple on paper, but is a rabbit hole of details when comes the time to get started. So let us start with the most abstract view possible:

The purpose of simulated annealing (SA) is to find a state of our problem which is optimal, with regard to some value or criteria. We want the “score” of our solution to increase, but because there are local optima, we want to allow sub-optimal moves at the beginning. SA uses the notion of temperature as a notion to indicate whether bad moves are possible. We will define “moves” much further down - just think of them as changing in the thing we want to optimize for now; a good move brings us closer to the optimum, a bad move doesn’t The system is initially “hot”, and it cools down as time passes. When the system is cold, only “good” moves are allowed. How rapidly we move from hot to cool is the “schedule” of the system, and this will make or break the opitmization process. This is a key concept in SA, and it is interesting to start here, because I think it makes understanding how the different notions fit together.

What is the cooling schedule of our system? There are a number of possible answers. We need to set an initial temperature (let’s call it $T_0$), and use this to get a value of $T_k$ at every timestep $k$. The simplest schedule (possibly the most widely used) is $T_k = \lambda^k T_0$, where λ is a value very close but inferior to 1 (for slow cooling). Another schedule is the exponential, where $T_k = e^{-\lambda k}\times T_0$. Finally, a much slower cooling schedule is $T_k = T_0/\text{log}(k)$. We can visualize these three, starting from a temperature of $T_0 = 10.0$:

There are interesting trade-offs with the three schedules: inverse-log will end up asymptotically close to 0, but will take approximately forever and a half to get there. Exponential decreases really fast (so we can get stuck in a local optimum), and the common geometric decay is in between.

Temperature is important to measure, because it lets us decide whether to accept bad moves temporarily, or reject them. Keeping the system too hot will allow bad moves even after a long time. Keeping the system very cold will only allow good moves early on, possibly trapping us in a local optimum. The cooling schedule needs to be “just right”, and of course there is no objective definition of what “just right” is.

The usual routine to decide on accepting or rejecting a move is as follows: a “move” is determined by its change in distance to the optimum. When we start, we can measure the initial distance between out system, and the optimum: Δ0. We now visit a “neighboring state” (we change the smallest possible thing in the thing we want to optimize), and we measure the distance again: Δi. If the move brings us closer to the optimum, Δm = Δi - Δ0 is a negative number. A good move should always be accepted, but we would like some bad moves to be accepted, especially early on – i.e. when the temperature is high, moves for which Δi ≥ Δ0 should sometimes be allowed.

This is usually done by deciding on a probability to accept the move:

$$\text{exp}\left(-\frac{\Delta_i - \Delta_0}{T_k}\right)$$

The probability of accepting a bad move will decline when the systems cools down. We can see this on the following figure (note that the x axis is reversed, the system cools downs as we move right):

Neutral or good moves have a “probability” (it’s not quite that) above unity, so will always be accepted. The probability of accepting bad moves plunges sharply when the temperature decreases. Because some values are larger than unity, the actual routine to decide on whether to pick a move is

$$\text{rand}() \leq \text{exp}\left(-\frac{\Delta_i - \Delta_0}{T_k}\right),.$$

We now have an intuition of what the cooling schedule will look like, and a heuristic to decide whether we accept a move. It is time to actually generate these moves. At this point, a concrete example helps immensely. So let’s come up with one. Let’s say we want to go pick some flowers, specifically Iris versicolor. We want to visit the last 300 sites in which they have been observed in Canada most recently, according to GBIF. Every point along the route is a pair of longitude and latitude coordinates, and a route is an ordering of visits to the points.

For example, these are the first five stops along the route.

5-element Array{Tuple{Float64,Float64},1}:
 (-75.8341, 45.0109)
 (-63.9, 44.491)    
 (-95.1744, 49.7422)
 (-79.8823, 43.2945)
 (-77.0985, 44.2844)

The quantity we want to optimize is the distance travelled (yes, this is a botanical version of the travelling salesperson problem) – the distance is measured by going back to where we started.

function distance(route)
  d = 0.0
  for i in 1:(length(route)-1)
    di = route[i].-route[(i+1)]
    d += sqrt(sum(di.^2))
  dend = route[end].-route[1]
  d += sqrt(sum(dend.^2))
  return d
distance (generic function with 1 method)

Because we picked the points at random, we start with a route which is not really optimized:

From the initial state of our route, we can take a neighboring state by picking two stops in the route, and swapping them. This can also be done by swapping two consecutive stops, but in my attempts with this problem, it gave worse results. With this, we can pick a cooling schedule (geometric, λ=0.999…), and a number of steps (10⁶, which runs in about 15 seconds on my laptop).

λ = 0.999999
t0 = 5000.0
Δbest = distance(route)
out = [(Δbest, t0)]
for step in 1:10_000_000
  t = λ^(step-1)*t0
  i, j = rand(1:length(route), 2)
  route[i], route[j] = route[j], route[i]
  Δstate = distance(route)
  Δ = Δstate - Δbest
  P = exp(-Δ/t)
  if (rand() ≤ P)
    Δbest = Δstate
    route[i], route[j] = route[j], route[i]
  push!(out, (Δbest, t))

This gives us the following route:

This is most likely not the best one, but since the goal of this post is to describe the concept broadly, this will suffice. Note that the distance has been divided by about 8. Getting a better route would require to tweak the cooling schedule, the routine to explore neighboring states, etc., which is far beyond the scope of an illustration.

To summarize, simulated annealing is an interesting method to optimize a problem. It gives reasonably good results in a reasonable amount of time, and is an interesting exercise in programming in any way. One thorny problem is to decide how to pick the initial temperature for the system. One solution I started playing with was deciding on an initial probability of accepting a bad state (and we don’t really have a rule to set this either), like P=0.7, and find the initial temperature that would give this value, on average, over 10³ neighbors of the initial state. I also like the fact that, like Naive Bayesian Classifiers, this can be written as about 20 lines of code.