Training a neural network on the seeds dataset
Using Flux.jlI have spent some time in the past two weeks playing with the Flux
package, a wonderful solution for machine
learning and differentiable programming in Julia. If you want to read a broad
overview, the team has written a very clear (and illustrated) blog
post. I have been impressed by
Flux
’s elegance, performance, and overall design philosophy.
Today, I would like to work on the
seeds
dataset, which is in my
opinion of vastly superior version of iris
. It has more instances, more
features, no connection to eugenics research that I know of, and the reference
data file is so poorly formatted that using it is an exerice in data cleaning in
and of itself. The point of today’s exercise is to use Flux
to train a neural
network, to show how a semi-complex analysis can be done in just about 20 lines
of code.
There are a bunch of packages required by this analysis:
using Flux
using CSV
using DataFrames
using Random
using Statistics
using StatsPlots
We can download the seeds
dataset from the UCI website. It has a strange issue
where the columns are separated by tabulation, but some columns are separated by
more than one even though there are no missing values, so the dropmissing
function from DataFrames
will help get rid of these:
tmp = tempname()
download("https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt", tmp)
seeds = dropmissing(CSV.read(tmp; header=0, delim='\t'));
Let’s rename the columns according to what they actually measure:
rename!(seeds,
(:Column1 => :area, :Column2 => :perimeter,
:Column3 => :compactness, :Column4 => :kernel_length,
:Column5 => :kernel_width, :Column6 => :asymmetry,
:Column7 => :kernel_groove, :Column8 => :cultivar)
)
The cultivar
column is what we want to predict. In the original dataset, it is
coded as a discete value, identifying the cultivar by 1, 2, or 3. Because a
sample can only have a single cultivar, this column should be “one-hot encoded”,
i.e. transformed into three columns, where a sample will have as single
non-zero value, in column $i$, representing that it belongs to the $i$-th
cultivar.
Flux
allows to do this with onehot
(for single samples), and onehotbatch
for multiple samples:
Flux.onehotbatch(seeds[:cultivar], unique(seeds[:cultivar]))
3×199 Flux.OneHotMatrix{Array{Flux.OneHotVector,1}}:
true true true true true … false false false false false
false false false false false false false false false false
false false false false false true true true true true
Note the matrix dimensions: 3 rows (for cultivars), and 199 columns (for the
samples). We may think of data as row-based, but Julia stores matrices
column-wise, so this format gives the most performance. Flux
is very
performant, and the code in this post is not at all optimal. The online
documentation has an entire chapter about optimization, which should be
mandatory reading.
I will use a simple leave-p-out cross-fold validation, which is to say that I will keep a fraction of the dataset for testing, and the rest for training.
Random.seed!(42)
n_train = convert(Int64, round(0.7*size(seeds, 1); digits=0))
training_indices = unique(rand(1:size(seeds, 1), n_train))
testing_indices = filter(x -> !(x in training_indices), 1:size(seeds,1))
This gives a total of 105 samples for the training step, and 94 samples for testing. With this information, we can build our data pile for the actual fun to begin:
trn_features = convert(Array, seeds[training_indices, 1:(end-1)])'
tst_features = convert(Array, seeds[testing_indices, 1:(end-1)])'
trn_cultivar = seeds[training_indices, end]
tst_cultivar = seeds[testing_indices, end]
trn_labels = Flux.onehotbatch(trn_cultivar, sort(unique(trn_cultivar)))
tst_labels = Flux.onehotbatch(tst_cultivar, sort(unique(tst_cultivar)))
As an aside, sort
ing the labels for one-hot batch encoding is very important
– it ensures that the two labels matrices have the same “frame of reference”
for the order of cultivars. If you do not sort the labels, you will have
some trouble.
Now, let’s define the model. We have 7 features, and 3 possible outputs, so we can know that our model will take 7 inputs, and give three outputs. Let’s try with the simplest possible version:
one_layer = Chain(Dense(7, 3), softmax)
Chain(Dense(7, 3), NNlib.softmax)
That’s it. This model will have a fully connected layer, and the result will be
given by the identity of the output node with the highest score (this is what
softmax
does).
Our network is currently untrained, so its performance should be pretty bad. How bad? Let’s figure it out.
untrained_prediction = Flux.onecold(one_layer(tst_features))
untrained_accuracy = mean(untrained_prediction .== Flux.onecold(tst_labels))
0.40425531914893614
Not great! Now, let’s train the model. We will use a gradient descent optimiser, and set a lowish learning rate:
optimizer = Descent(0.01)
Flux.Optimise.Descent(0.01)
We also need a loss function, and we will use cross entropy:
loss(x, y) = Flux.crossentropy(one_layer(x), y)
loss (generic function with 1 method)
Notice that the loss
function refers to one_layer
, which is the actual model
we want to train. Training the model in Flux
is now very simple:
Flux.train!(loss, params(one_layer), [(trn_features, trn_labels)], optimizer)
The [(trn_features, trn_labels)]
bit is a representation of the data we use
for training. As it stands, we are asking for a single epoch of training, so we
don’t expect a massive change in accuracy:
epoch_1_prediction = Flux.onecold(one_layer(tst_features))
epoch_1_accuracy = mean(epoch_1_prediction .== Flux.onecold(tst_labels))
0.44680851063829785
Let’s repeat this a little bit longer. Flux
has support for data iterators,
but we will go the quick and easy route of using a loop:
e = Int64[]
loss_on_train = Float32[]
loss_on_test = Float32[]
for epoch in 1:2000
Flux.train!(loss, params(one_layer), [(trn_features, trn_labels)], optimizer)
push!(e, epoch)
push!(loss_on_test, loss(tst_features, tst_labels).data)
push!(loss_on_train, loss(trn_features, trn_labels).data)
end
Of course, this is just about the worse way (in terms of performance) to train this model – but this will do for an explanation. We can plot the loss on the training and testing dataset over training:
plot(e, loss_on_train, lab="Training", c=:black, lw=2);
plot!(e, loss_on_test, lab="Testing", c=:teal, ls=:dot);
yaxis!("Loss", :log);
xaxis!("Training epoch");

Loss function over training epochs for the training and testing dataset. We are decreasing the loss in both datasets, without reaching the point where the loss in the testing dataset would increase again.
We can now look at the accuracy of the trained model, i.e. the proportion of correctly identified cases:
mean(Flux.onecold(one_layer(trn_features)) .== Flux.onecold(trn_labels))
0.8952380952380953
Not bad at all! The next step would be to look at the confusion table:
function confusion_matrix(ft, lb)
plb = Flux.onehotbatch(Flux.onecold(one_layer(ft)), 1:3)
lb * plb'
end
confusion_matrix(tst_features, tst_labels)
3×3 Array{Int64,2}:
27 1 2
1 34 0
0 0 29
As expected after training, we see that most of the values fall on the diagonal: our model is really good at classifying cultivars based on their morphometrics.
Now what if we wanted to use a more complex model? Let’s say that we want to add one hidden layer, with 14 nodes, and add a sigmoid activation to the input layer (just because). The required steps are to specificy the new model architecture:
hidden_size = 14
model = Chain(
Dense(7, hidden_size, σ),
Dense(hidden_size, 3),
softmax
)
Chain(Dense(7, 14, NNlib.σ), Dense(14, 3), NNlib.softmax)
Then we need to define the loss function again (because it makes reference to a specific model):
v2_loss(x, y) = Flux.crossentropy(model(x), y)
v2_loss (generic function with 1 method)
Rather than looping, we will use a data iterator to handle the training epochs:
data = Iterators.repeated((trn_features, trn_labels), 2000)
And finally, we can perform the entire training in a single line, because every
element in data
will represent one epoch:
Flux.train!(v2_loss, params(model), data, optimizer)
Our revised accuracy with this model is:
mean(Flux.onecold(model(trn_features)) .== Flux.onecold(trn_labels))
0.7714285714285715
And the new confusion matrix is given by:
function v2_confusion_matrix(ft, lb)
plb = Flux.onehotbatch(Flux.onecold(model(ft)), 1:3)
lb * plb'
end
v2_confusion_matrix(tst_features, tst_labels)
3×3 Array{Int64,2}:
17 10 3
8 27 0
0 0 29
This is worse than the previous model (because, well, we don’t really need to move past the previous model seeing how well-performing it is), but serves at least as an illustration of how more complex models are built.
To summarize; Flux
is a very expressive way of specifying models in Julia, can
deliver very good performance, and code that is easy to write, read, and
maintain. It’s an impressive piece of software, and I encourage you to check it
out.