SDM with bioclim in three lines of code

Or why API matters

In this post, we will walk through the process of implementing the bioclim species distribution model in Julia, and use it to make a simple prediction based on bioclimatic data. Most of the code is based on SimpleSDMLayers.jl, an in-development package.

Getting the data

To start with, let’s get some data about amphibians in Europe:

# Prepare a folder to store the data
ispath("data") || mkdir("data")
data_folder = joinpath("data", "bioclim-sdm")
ispath(data_folder) || mkdir(data_folder)

# Download the data
_project_root = "https://raw.githubusercontent.com/EcoJulia/SpatialEcology.jl/master/"
amph_file = joinpath(data_folder, "amphibians.csv")
isfile(amph_file) || download(_project_root*"data/amph_Europe.csv", amph_file)

Our next step will be to read these data, and get the occurrences for the first species. We can do the next step with readlines and split and a few converts, but since this is a csv file, let’s load it properly using CSV and DataFrames:

import CSV
using DataFrames
amphibians = CSV.read(amph_file);

We will focus on the first species (Salamandra salamandra), and we specifically want to get the latitudes and longitudes at which it has been observed:

occupied = amphibians[:Salamandra_salamandra] .> 0.0
salamandra = amphibians[occupied,[:Lat, :Long]]
rename!(salamandra, (:Lat => :latitude), (:Long => :longitude))
salamandra[1:10,:]

10 rows × 2 columns

latitudelongitude
Float64Float64
146.517.5
247.517.5
337.524.5
438.524.5
538.523.5
638.522.5
739.523.5
843.516.5
944.514.5
1045.514.5

Note that in this example, we would have data about absences, but let’s disregard this for this illustration. Now, we will need to download some bioclimatic data, so let’s get started. As an illustration, we will get only the temperature and precipitation from the worldclim 2.0 database, which can get downloaded directly using SimpleSDMLayers:

using SimpleSDMLayers
using Plots
assets_path = joinpath("data", "assets")
ispath(assets_path) || mkdir(assets_path)
temperature, precipitation = worldclim([1,12]; path=assets_path)
heatmap(temperature, c=:Oranges, frame=:box, size=(800,400), dpi=120)

Map of temperatures from the worldclim data, with a resolution of 10 arc minutes.

You might notice that this map is way too large compared to our data, so let’s define a bounding box.

adj = 2.0
bbox = (
  (minimum(salamandra.longitude)-adj,maximum(salamandra.longitude)+adj),
  (minimum(salamandra.latitude)-adj,maximum(salamandra.latitude)+adj)
  )
((-11.5, 29.5), (34.5, 55.5))

And now let’s clip our data again:

temperature_europe = temperature[bbox...]
precipitation_europe = precipitation[bbox...]
heatmap(precipitation_europe, c=:Blues, frame=:box, size=(800,400), dpi=120)

And just like this, we have all we need to start making predictions. So let us move to…

Building the model

The bioclim model is really, really straightforward. We need to extract the value of a variable where a species occur, and then transform the quantile into a score, so that the score is between 0 and 1, and a species occuring at the median value will get a score of one. This is done for all variables, and the results are reduced by taking the minimum score across all variables, so that the occurrence score is limited by the most limiting factor.

Let us write a method to extract the values from a DataFrame and a SimpleSDMLayer. To facilitate the work, we will split this into a method for the DataFrame, which will also ensure that NaN are removed from the result, and a method for the DataFrameRow, which we really should modify to check for the presence of the required columns :latitude and :longitude, for example:

function Base.getindex(layer::T, df::D) where {T <: SimpleSDMLayer, D <: DataFrame}
  return filter(!isnan, [layer[row] for row in eachrow(df)])
end

function Base.getindex(layer::T, row::R) where {T <: SimpleSDMLayer, R <: DataFrameRow}
  return layer[row.longitude, row.latitude]
end

histogram(temperature_europe[salamandra], size=(800,400), dpi=120, leg=false)

Distribution of the temperatures in which Salamandra salamandra has been observed in Europe.

From this, we need to get the quantile function, and transform the quantile into a score.

using StatsBase
temperature_qf = ecdf(temperature_europe[salamandra])
score(x) = isnan(x) ? NaN : (x > 0.5 ? 1-x : 2x)
score (generic function with 1 method)

Based on this, we can transform the map of temperatures into a map of prediction:

temperature_prediction = similar(temperature_europe);
temperature_prediction.grid = score.(temperature_qf.(temperature_europe.grid));

One thing we need to do is to replace the positions with NaN originally by NaN, because ecdf does not play nicely with these (for now):

for idx in findall(isnan, temperature_europe.grid)
  temperature_prediction.grid[idx] = NaN
end
heatmap(temperature_prediction, c=:viridis, clim=(0,1), frame=:box, size=(800,400), dpi=120)

Predicted occurrence based on temperature.

And here is a map. We can now write a function to make a prediction based on a layer, which will look a lot like this:

function bioclimpred(layer::T, df::D) where {T <: SimpleSDMLayer, D <: DataFrame}
  pred = similar(layer)
  qf = ecdf(layer[df])
  pred.grid = score.(qf.(layer.grid))
  for idx in findall(isnan, layer.grid)
    pred.grid[idx] = NaN
  end
  return pred
end
bioclimpred (generic function with 1 method)

Note how most of the lines are either to prepare the objects, or to fix the behavior of ecdf. Pretty neat! This is bioclim, in about three lines of relevant code.

Bringing it all together

The final step is to take these predictions over multiple layers, and take the minimum. To make our lives simpler, we will write a min method for pairs of layers:

function Base.min(x::TX, y::TY) where {TX <: SimpleSDMLayer, TY <: SimpleSDMLayer}
  SimpleSDMLayers.are_compatible(x,y)
  m = similar(x)
  for i in eachindex(m)
    if !isnan(m[i])
      m[i] = min(x[i], y[i])
    end
  end
  return m
end

We can now formulate the predictions for temperature and precipitation, and take their minimum:

pred = minimum(
  bioclimpred(temperature_europe, salamandra),
  bioclimpred(precipitation_europe, salamandra)
  )
plot(pred, c=:plasma, frame=:box, clims=(0,1), size=(800,400), dpi=120)

Predicted occurrences based on two bioclimatic variables.

Oh, and it’s rather fast – generating a prediction for the entire world is under 0.2 second, before we even thought about possible optimizations.

Good enough. Now, we will need to download some bioclimatic data, so let’s get started. As an illustration, we will get only the temperature and precipitation from the worldclim 2.0 database:

@time bioclimpred(temperature, salamandra);
0.159803 seconds (2.52 k allocations: 77.434 MiB, 8.71% gc time)

What have we learned?

Sometimes writing a little bit of glue code (all our Base.xxx methods) pays off immensely when creating more complex pipelines. One of my favorite part of Julia is that it allows, and even encourages, this approach, where it is possible to make any new structure mimic the syntax we would have to use with the core of the language. It just flows. This improves the readability of the code, and also the potential for re-use.