# Training a neural network on the seeds dataset

Using Flux.jlI have spent some time in the past two weeks playing with the `Flux`

package, a wonderful solution for machine
learning and differentiable programming in Julia. If you want to read a broad
overview, the team has written a very clear (and illustrated) blog
post. I have been impressed by
`Flux`

’s elegance, performance, and overall design philosophy.

Today, I would like to work on the
`seeds`

dataset, which is in my
opinion of vastly superior version of `iris`

. It has more instances, more
features, no connection to eugenics research that I know of, and the reference
data file is so poorly formatted that using it is an exerice in data cleaning in
and of itself. The point of today’s exercise is to use `Flux`

to train a neural
network, to show how a semi-complex analysis can be done in just about 20 lines
of code.

There are a bunch of packages required by this analysis:

```
using Flux
using CSV
using DataFrames
using Random
using Statistics
using StatsPlots
```

We can download the `seeds`

dataset from the UCI website. It has a strange issue
where the columns are separated by tabulation, but some columns are separated by
more than one even though there are no missing values, so the `dropmissing`

function from `DataFrames`

will help get rid of these:

```
tmp = tempname()
download("https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt", tmp)
seeds = dropmissing(CSV.read(tmp; header=0, delim='\t'));
```

Let’s rename the columns according to what they actually measure:

```
rename!(seeds,
(:Column1 => :area, :Column2 => :perimeter,
:Column3 => :compactness, :Column4 => :kernel_length,
:Column5 => :kernel_width, :Column6 => :asymmetry,
:Column7 => :kernel_groove, :Column8 => :cultivar)
)
```

The `cultivar`

column is what we want to predict. In the original dataset, it is
coded as a discete value, identifying the cultivar by 1, 2, or 3. Because a
sample can only have a single cultivar, this column should be “one-hot encoded”,
*i.e.* transformed into three columns, where a sample will have as single
non-zero value, in column $i$, representing that it belongs to the $i$-th
cultivar.

`Flux`

allows to do this with `onehot`

(for single samples), and `onehotbatch`

for multiple samples:

```
Flux.onehotbatch(seeds[:cultivar], unique(seeds[:cultivar]))
```

```
3×199 Flux.OneHotMatrix{Array{Flux.OneHotVector,1}}:
true true true true true … false false false false false
false false false false false false false false false false
false false false false false true true true true true
```

Note the matrix dimensions: 3 rows (for cultivars), and 199 columns (for the
samples). We may think of data as row-based, but Julia stores matrices
column-wise, so this format gives the most performance. `Flux`

is *very*
performant, and the code in this post is not at all optimal. The online
documentation has an entire chapter about optimization, which should be
mandatory reading.

I will use a simple leave-p-out cross-fold validation, which is to say that I
will keep a fraction of the dataset for *testing*, and the rest for *training*.

```
Random.seed!(42)
n_train = convert(Int64, round(0.7*size(seeds, 1); digits=0))
training_indices = unique(rand(1:size(seeds, 1), n_train))
testing_indices = filter(x -> !(x in training_indices), 1:size(seeds,1))
```

This gives a total of 105 samples for the training step, and 94 samples for testing. With this information, we can build our data pile for the actual fun to begin:

```
trn_features = convert(Array, seeds[training_indices, 1:(end-1)])'
tst_features = convert(Array, seeds[testing_indices, 1:(end-1)])'
trn_cultivar = seeds[training_indices, end]
tst_cultivar = seeds[testing_indices, end]
trn_labels = Flux.onehotbatch(trn_cultivar, sort(unique(trn_cultivar)))
tst_labels = Flux.onehotbatch(tst_cultivar, sort(unique(tst_cultivar)))
```

As an aside, `sort`

ing the labels for one-hot batch encoding is *very important*
– it ensures that the two labels matrices have the same “frame of reference”
for the order of cultivars. If you do not sort the labels, you **will** have
some trouble.

Now, let’s define the model. We have 7 features, and 3 possible outputs, so we can know that our model will take 7 inputs, and give three outputs. Let’s try with the simplest possible version:

```
one_layer = Chain(Dense(7, 3), softmax)
```

```
Chain(Dense(7, 3), NNlib.softmax)
```

That’s it. This model will have a fully connected layer, and the result will be
given by the identity of the output node with the highest score (this is what
`softmax`

does).

Our network is currently untrained, so its performance should be pretty bad. How bad? Let’s figure it out.

```
untrained_prediction = Flux.onecold(one_layer(tst_features))
untrained_accuracy = mean(untrained_prediction .== Flux.onecold(tst_labels))
```

```
0.40425531914893614
```

Not great! Now, let’s train the model. We will use a gradient descent optimiser, and set a lowish learning rate:

```
optimizer = Descent(0.01)
```

```
Flux.Optimise.Descent(0.01)
```

We also need a loss function, and we will use cross entropy:

```
loss(x, y) = Flux.crossentropy(one_layer(x), y)
```

```
loss (generic function with 1 method)
```

Notice that the `loss`

function refers to `one_layer`

, which is the actual model
we want to train. Training the model in `Flux`

is now very simple:

```
Flux.train!(loss, params(one_layer), [(trn_features, trn_labels)], optimizer)
```

The `[(trn_features, trn_labels)]`

bit is a representation of the data we use
for training. As it stands, we are asking for a single epoch of training, so we
don’t expect a massive change in accuracy:

```
epoch_1_prediction = Flux.onecold(one_layer(tst_features))
epoch_1_accuracy = mean(epoch_1_prediction .== Flux.onecold(tst_labels))
```

```
0.44680851063829785
```

Let’s repeat this a little bit longer. `Flux`

has support for data iterators,
but we will go the quick and easy route of using a loop:

```
e = Int64[]
loss_on_train = Float32[]
loss_on_test = Float32[]
for epoch in 1:2000
Flux.train!(loss, params(one_layer), [(trn_features, trn_labels)], optimizer)
push!(e, epoch)
push!(loss_on_test, loss(tst_features, tst_labels).data)
push!(loss_on_train, loss(trn_features, trn_labels).data)
end
```

Of course, this is just about the worse way (in terms of performance) to train this model – but this will do for an explanation. We can plot the loss on the training and testing dataset over training:

```
plot(e, loss_on_train, lab="Training", c=:black, lw=2);
plot!(e, loss_on_test, lab="Testing", c=:teal, ls=:dot);
yaxis!("Loss", :log);
xaxis!("Training epoch");
```

We can now look at the accuracy of the trained model, *i.e.* the proportion of
correctly identified cases:

```
mean(Flux.onecold(one_layer(trn_features)) .== Flux.onecold(trn_labels))
```

```
0.8952380952380953
```

Not bad at all! The next step would be to look at the confusion table:

```
function confusion_matrix(ft, lb)
plb = Flux.onehotbatch(Flux.onecold(one_layer(ft)), 1:3)
lb * plb'
end
confusion_matrix(tst_features, tst_labels)
```

```
3×3 Array{Int64,2}:
27 1 2
1 34 0
0 0 29
```

As expected after training, we see that most of the values fall on the diagonal:
our model is *really good* at classifying cultivars based on their
morphometrics.

Now what if we wanted to use a more complex model? Let’s say that we want to add one hidden layer, with 14 nodes, and add a sigmoid activation to the input layer (just because). The required steps are to specificy the new model architecture:

```
hidden_size = 14
model = Chain(
Dense(7, hidden_size, σ),
Dense(hidden_size, 3),
softmax
)
```

```
Chain(Dense(7, 14, NNlib.σ), Dense(14, 3), NNlib.softmax)
```

Then we need to define the loss function again (because it makes reference to a specific model):

```
v2_loss(x, y) = Flux.crossentropy(model(x), y)
```

```
v2_loss (generic function with 1 method)
```

Rather than looping, we will use a data iterator to handle the training epochs:

```
data = Iterators.repeated((trn_features, trn_labels), 2000)
```

And finally, we can perform the entire training in a single line, because every
element in `data`

will represent one epoch:

```
Flux.train!(v2_loss, params(model), data, optimizer)
```

Our revised accuracy with this model is:

```
mean(Flux.onecold(model(trn_features)) .== Flux.onecold(trn_labels))
```

```
0.7714285714285715
```

And the new confusion matrix is given by:

```
function v2_confusion_matrix(ft, lb)
plb = Flux.onehotbatch(Flux.onecold(model(ft)), 1:3)
lb * plb'
end
v2_confusion_matrix(tst_features, tst_labels)
```

```
3×3 Array{Int64,2}:
17 10 3
8 27 0
0 0 29
```

This is worse than the previous model (because, well, we don’t really need to move past the previous model seeing how well-performing it is), but serves at least as an illustration of how more complex models are built.

To summarize; `Flux`

is a very expressive way of specifying models in Julia, can
deliver very good performance, and code that is easy to write, read, and
maintain. It’s an impressive piece of software, and I encourage you to check it
out.