I spent last week working on a new projet, which involved generating networks with fixed structural properties. I experimented with some approaches, but simulated annealing gave by far the best results. I thought it would be a nice opportunity to aggregate some notes and code here. We will see the way temperature decreases to constrain optimization, how to decide on accepting a change of state as a function of temperature, and then an illustration on species occurrence data.

Simulated annealing is one of these methods which look simple on paper, but is a rabbit hole of details when comes the time to get started. So let us start with the most abstract view possible:

The purpose of simulated annealing (SA) is to find a state of our problem which is optimal, with regard to some value or criteria. We want the “score” of our solution to increase, but because there are local optima, we want to allow sub-optimal moves at the beginning. SA uses the notion of temperature as a notion to indicate whether bad moves are possible. We will define “moves” much further down - just think of them as changing in the thing we want to optimize for now; a good move brings us closer to the optimum, a bad move doesn’t The system is initially “hot”, and it cools down as time passes. When the system is cold, only “good” moves are allowed. How rapidly we move from hot to cool is the “schedule” of the system, and this will make or break the opitmization process. This is a key concept in SA, and it is interesting to start here, because I think it makes understanding how the different notions fit together.

What is the cooling schedule of our system? There are a number of possible answers. We need to set an initial temperature (let’s call it $T_0$), and use this to get a value of $T_k$ at every timestep $k$. The simplest schedule (possibly the most widely used) is $T_k = \lambda^k T_0$, where λ is a value very close but inferior to 1 (for slow cooling). Another schedule is the exponential, where $T_k = e^{-\lambda k}\times T_0$. Finally, a much slower cooling schedule is $T_k = T_0/\text{log}(k)$. We can visualize these three, starting from a temperature of $T_0 = 10.0$:

There are interesting trade-offs with the three schedules: inverse-log will end up asymptotically close to 0, but will take approximately forever and a half to get there. Exponential decreases really fast (so we can get stuck in a local optimum), and the common geometric decay is in between.

Temperature is important to measure, because it lets us decide whether to accept
bad moves temporarily, or reject them. Keeping the system *too hot* will allow
bad moves even after a long time. Keeping the system very cold will only allow
good moves early on, possibly trapping us in a local optimum. The cooling
schedule needs to be “just right”, and of course there is no objective
definition of what “just right” is.

The usual routine to decide on accepting or rejecting a move is as follows: a
“move” is determined by its change in distance to the optimum. When we start, we
can measure the initial distance between out system, and the optimum: Δ0. We now
visit a “neighboring state” (we change the smallest possible thing in the thing
we want to optimize), and we measure the distance again: Δi. If the move brings
us closer to the optimum, Δm = Δi - Δ0 is a negative number. A good move should
*always* be accepted, but we would like some bad moves to be accepted,
especially early on – *i.e.* when the temperature is high, moves for which Δi ≥
Δ0 should sometimes be allowed.

This is usually done by deciding on a probability to accept the move:

$$\text{exp}\left(-\frac{\Delta_i - \Delta_0}{T_k}\right)$$

The probability of accepting a bad move will *decline* when the systems cools
down. We can see this on the following figure (note that the x axis is
*reversed*, the system cools downs as we move right):

Neutral or good moves have a “probability” (it’s not quite that) above unity, so
will always be accepted. The probability of accepting bad moves plunges sharply
when the temperature decreases. Because some values are larger than unity, the
*actual* routine to decide on whether to pick a move is

$$\text{rand}() \leq \text{exp}\left(-\frac{\Delta_i - \Delta_0}{T_k}\right)\,.$$

We now have an intuition of what the cooling schedule will look like, and a
heuristic to decide whether we accept a move. It is time to actually generate
these moves. At this point, a concrete example helps *immensely*. So let’s come
up with one. Let’s say we want to go pick some flowers, specifically *Iris
versicolor*. We want to visit the last 300 sites in which they have been
observed in Canada most recently, according to GBIF. Every point along the route
is a pair of longitude and latitude coordinates, and a route is an ordering of
visits to the points.

For example, these are the first five stops along the route.

```
route[1:5]
```

```
5-element Array{Tuple{Float64,Float64},1}:
(-75.8341, 45.0109)
(-63.9, 44.491)
(-95.1744, 49.7422)
(-79.8823, 43.2945)
(-77.0985, 44.2844)
```

The quantity we want to optimize is the distance travelled (yes, this is a
botanical version of the travelling salesperson problem) – the distance is
measured by going *back to where we started*.

```
function distance(route)
d = 0.0
for i in 1:(length(route)-1)
di = route[i].-route[(i+1)]
d += sqrt(sum(di.^2))
end
dend = route[end].-route[1]
d += sqrt(sum(dend.^2))
return d
end
```

```
distance (generic function with 1 method)
```

Because we picked the points at random, we start with a route which is not really optimized:

From the initial state of our route, we can take a neighboring state by picking
two stops in the route, and swapping them. This can also be done by swapping two
*consecutive* stops, but in my attempts with this problem, it gave worse
results. With this, we can pick a cooling schedule (geometric, λ=0.999…), and
a number of steps (10⁶, which runs in about 15 seconds on my laptop).

```
λ = 0.999999
t0 = 5000.0
Δbest = distance(route)
out = [(Δbest, t0)]
for step in 1:10_000_000
t = λ^(step-1)*t0
i, j = rand(1:length(route), 2)
route[i], route[j] = route[j], route[i]
Δstate = distance(route)
Δ = Δstate - Δbest
P = exp(-Δ/t)
if (rand() ≤ P)
Δbest = Δstate
else
route[i], route[j] = route[j], route[i]
end
push!(out, (Δbest, t))
end
```

This gives us the following route:

This is most likely not the best one, but since the goal of this post is to describe the concept broadly, this will suffice. Note that the distance has been divided by about 8. Getting a better route would require to tweak the cooling schedule, the routine to explore neighboring states, etc., which is far beyond the scope of an illustration.

**To summarize**, simulated annealing is an interesting method to optimize a
problem. It gives reasonably good results in a reasonable amount of time, and
is an interesting exercise in programming in any way. One thorny problem is to
decide how to pick the initial temperature for the system. One solution I
started playing with was deciding on an initial probability of accepting a bad
state (and we don’t really have a rule to set this either), like P=0.7, and
find the initial temperature that would give this value, on average, over 10³
neighbors of the initial state. I also like the fact that, like Naive Bayesian
Classifiers, this can be written as about 20 lines of code.