Training a neural network on the seeds dataset

Using Flux.jl

I have spent some time in the past two weeks playing with the Flux package, a wonderful solution for machine learning and differentiable programming in Julia. If you want to read a broad overview, the team has written a very clear (and illustrated) blog post. I have been impressed by Flux’s elegance, performance, and overall design philosophy.

Today, I would like to work on the seeds dataset, which is in my opinion of vastly superior version of iris. It has more instances, more features, no connection to eugenics research that I know of, and the reference data file is so poorly formatted that using it is an exerice in data cleaning in and of itself. The point of today’s exercise is to use Flux to train a neural network, to show how a semi-complex analysis can be done in just about 20 lines of code.

There are a bunch of packages required by this analysis:

using Flux
using CSV
using DataFrames
using Random
using Statistics
using StatsPlots

We can download the seeds dataset from the UCI website. It has a strange issue where the columns are separated by tabulation, but some columns are separated by more than one even though there are no missing values, so the dropmissing function from DataFrames will help get rid of these:

tmp = tempname()
download("https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt", tmp)
seeds = dropmissing(CSV.read(tmp; header=0, delim='\t'));

Let’s rename the columns according to what they actually measure:

rename!(seeds,
  (:Column1 => :area, :Column2 => :perimeter,
  :Column3 => :compactness, :Column4 => :kernel_length,
  :Column5 => :kernel_width, :Column6 => :asymmetry,
  :Column7 => :kernel_groove, :Column8 => :cultivar)
  )

The cultivar column is what we want to predict. In the original dataset, it is coded as a discete value, identifying the cultivar by 1, 2, or 3. Because a sample can only have a single cultivar, this column should be “one-hot encoded”, i.e. transformed into three columns, where a sample will have as single non-zero value, in column $i$, representing that it belongs to the $i$-th cultivar.

Flux allows to do this with onehot (for single samples), and onehotbatch for multiple samples:

Flux.onehotbatch(seeds[:cultivar], unique(seeds[:cultivar]))
3×199 Flux.OneHotMatrix{Array{Flux.OneHotVector,1}}:
  true   true   true   true   true  …  false  false  false  false  false
 false  false  false  false  false     false  false  false  false  false
 false  false  false  false  false      true   true   true   true   true

Note the matrix dimensions: 3 rows (for cultivars), and 199 columns (for the samples). We may think of data as row-based, but Julia stores matrices column-wise, so this format gives the most performance. Flux is very performant, and the code in this post is not at all optimal. The online documentation has an entire chapter about optimization, which should be mandatory reading.

I will use a simple leave-p-out cross-fold validation, which is to say that I will keep a fraction of the dataset for testing, and the rest for training.

Random.seed!(42)
n_train = convert(Int64, round(0.7*size(seeds, 1); digits=0))
training_indices = unique(rand(1:size(seeds, 1), n_train))
testing_indices = filter(x -> !(x in training_indices), 1:size(seeds,1))

This gives a total of 105 samples for the training step, and 94 samples for testing. With this information, we can build our data pile for the actual fun to begin:

trn_features = convert(Array, seeds[training_indices, 1:(end-1)])'
tst_features = convert(Array, seeds[testing_indices, 1:(end-1)])'
trn_cultivar = seeds[training_indices, end]
tst_cultivar = seeds[testing_indices, end]
trn_labels = Flux.onehotbatch(trn_cultivar, sort(unique(trn_cultivar)))
tst_labels = Flux.onehotbatch(tst_cultivar, sort(unique(tst_cultivar)))

As an aside, sorting the labels for one-hot batch encoding is very important – it ensures that the two labels matrices have the same “frame of reference” for the order of cultivars. If you do not sort the labels, you will have some trouble.

Now, let’s define the model. We have 7 features, and 3 possible outputs, so we can know that our model will take 7 inputs, and give three outputs. Let’s try with the simplest possible version:

one_layer = Chain(Dense(7, 3), softmax)
Chain(Dense(7, 3), NNlib.softmax)

That’s it. This model will have a fully connected layer, and the result will be given by the identity of the output node with the highest score (this is what softmax does).

Our network is currently untrained, so its performance should be pretty bad. How bad? Let’s figure it out.

untrained_prediction = Flux.onecold(one_layer(tst_features))
untrained_accuracy = mean(untrained_prediction .== Flux.onecold(tst_labels))
0.40425531914893614

Not great! Now, let’s train the model. We will use a gradient descent optimiser, and set a lowish learning rate:

optimizer = Descent(0.01)
Flux.Optimise.Descent(0.01)

We also need a loss function, and we will use cross entropy:

loss(x, y) = Flux.crossentropy(one_layer(x), y)
loss (generic function with 1 method)

Notice that the loss function refers to one_layer, which is the actual model we want to train. Training the model in Flux is now very simple:

Flux.train!(loss, params(one_layer), [(trn_features, trn_labels)], optimizer)

The [(trn_features, trn_labels)] bit is a representation of the data we use for training. As it stands, we are asking for a single epoch of training, so we don’t expect a massive change in accuracy:

epoch_1_prediction = Flux.onecold(one_layer(tst_features))
epoch_1_accuracy = mean(epoch_1_prediction .== Flux.onecold(tst_labels))
0.44680851063829785

Let’s repeat this a little bit longer. Flux has support for data iterators, but we will go the quick and easy route of using a loop:

e = Int64[]
loss_on_train = Float32[]
loss_on_test = Float32[]
for epoch in 1:2000
  Flux.train!(loss, params(one_layer), [(trn_features, trn_labels)], optimizer)
  push!(e, epoch)
  push!(loss_on_test, loss(tst_features, tst_labels).data)
  push!(loss_on_train, loss(trn_features, trn_labels).data)
end

Of course, this is just about the worse way (in terms of performance) to train this model – but this will do for an explanation. We can plot the loss on the training and testing dataset over training:

plot(e, loss_on_train, lab="Training", c=:black, lw=2);
plot!(e, loss_on_test, lab="Testing", c=:teal, ls=:dot);
yaxis!("Loss", :log);
xaxis!("Training epoch");

Loss function over training epochs for the training and testing dataset. We are decreasing the loss in both datasets, without reaching the point where the loss in the testing dataset would increase again.

We can now look at the accuracy of the trained model, i.e. the proportion of correctly identified cases:

mean(Flux.onecold(one_layer(trn_features)) .== Flux.onecold(trn_labels))
0.8952380952380953

Not bad at all! The next step would be to look at the confusion table:

function confusion_matrix(ft, lb)
  plb = Flux.onehotbatch(Flux.onecold(one_layer(ft)), 1:3)
  lb * plb'
end
confusion_matrix(tst_features, tst_labels)
3×3 Array{Int64,2}:
 27   1   2
  1  34   0
  0   0  29

As expected after training, we see that most of the values fall on the diagonal: our model is really good at classifying cultivars based on their morphometrics.

Now what if we wanted to use a more complex model? Let’s say that we want to add one hidden layer, with 14 nodes, and add a sigmoid activation to the input layer (just because). The required steps are to specificy the new model architecture:

hidden_size = 14
model = Chain(
  Dense(7, hidden_size, σ),
  Dense(hidden_size, 3),
  softmax
  )
Chain(Dense(7, 14, NNlib.σ), Dense(14, 3), NNlib.softmax)

Then we need to define the loss function again (because it makes reference to a specific model):

v2_loss(x, y) = Flux.crossentropy(model(x), y)
v2_loss (generic function with 1 method)

Rather than looping, we will use a data iterator to handle the training epochs:

data = Iterators.repeated((trn_features, trn_labels), 2000)

And finally, we can perform the entire training in a single line, because every element in data will represent one epoch:

Flux.train!(v2_loss, params(model), data, optimizer)

Our revised accuracy with this model is:

mean(Flux.onecold(model(trn_features)) .== Flux.onecold(trn_labels))
0.7714285714285715

And the new confusion matrix is given by:

function v2_confusion_matrix(ft, lb)
  plb = Flux.onehotbatch(Flux.onecold(model(ft)), 1:3)
  lb * plb'
end
v2_confusion_matrix(tst_features, tst_labels)
3×3 Array{Int64,2}:
 17  10   3
  8  27   0
  0   0  29

This is worse than the previous model (because, well, we don’t really need to move past the previous model seeing how well-performing it is), but serves at least as an illustration of how more complex models are built.

To summarize; Flux is a very expressive way of specifying models in Julia, can deliver very good performance, and code that is easy to write, read, and maintain. It’s an impressive piece of software, and I encourage you to check it out.