# Naive Bayesian Classification of parasite data

Classification is the task where, knowing some values or measurements of a thing, we decide which category we should put this thing into. People that are more serious about their nomenclature than I am would talk about labels, instances, and features. Since I am updating my lecture notes on the technique, I will share an explanation and an example using parasite data.

During my Master’s, I worked on the morphometry of fish parasites from the Monopisthocotylea subclass. The way to identify most of these species is to put them into the microscope, look very closely as the few solid parts of the hooks (used to attached to the gills or skin) and reproductive organs, and in some cases, take measures of the solid parts to compare against reference measurements.

This is what we can call a
“classification” task. Given information about a sample (a parasite), we want to
find the category (the species) to which it belongs. Naive Bayesian classifiers
(NBC, which I’ll use both for the classifiers themselves and for the method) use
some assumptions to simplify the data structure to provide a prediction about
the most likely class. The name of the method gives all the information we need.
It is naive (but it’s ok, because data are unreasonably effective @HaleNorv09
anyways), because it makes simplications and assumptions. It is Bayesian (but
not *that* much Bayesian), because it relies on conditional probabilities. And
it does classification. **Let’s dig in**.

In the case of assigning parasites to their species based on measurements, what
we want is the probability that the individual (our *problem instance*) belongs
to each species (the *labels* in our classification problem), given the
measurements (the *features*) we know about. Or in other words, $p(C_k|x_1,
\dots, x_n)$., where $C_k$ is a species, and $x_1, \dots, x_n$ are the
measurements for the sample. **This is where the NBC becomes Bayesian**: we can
rewrite this using Bayes' theorem,

$$p(C_k|\mathbf{x})=\frac{p(C_k)p(\mathbf{x}|C_k)}{p(\mathbf{x})} ,.$$

If we unpack this formula, $p(C_k)$ is the prior probability of class $k$, and
$p(\textbf{x}|C_k)$ is the likelihood. Let’s talk about $p(\textbf{x})$ for a
minute. This terms represents the probability of data. Because we know the
values of the features we have measured, this term will be a constant for all
classes. And because we are interested in the *most likely* class, we can remove
it from the expression. We end up with

$$p(C_k|\mathbf{x}) \propto p(C_k)p(\mathbf{x}|C_k),.$$

At this point, **NBC becomes Naive**: we will assume the conditional
independance of all features. Of course nothing in biology is conditionally
independant, and especially not morphomological features. But we can assume that
they are, and see where this leads us. If all features are independant, then we
can rewrite $p(\textbf{x}|C_k)$ as the much easier to deal with $\prod_i
p(x_i|C_k)$. We plug back this expressing in the previous formula, and end up
with

$$p(C_k|\mathbf{x}) \propto p(C_k)\prod_i p(x_i|C_k),.$$

The final step is to apply the *maximum a posteriori* decision rule, which is a
fancy way of saying that we take the class with the higher probability. Because
we got rid of $p(\mathbf{x})$, this is only *proportional to* the actual
probability. In any cas, we assign this sample to the class $\hat y$, where

$$\hat y = \text{argmax}\left(p(C_k)\prod_i p(x_i|C_k)\right) ,.$$

I will now use the data from *Lamellodiscus* on sparid fishes in the
Banyuls-sur-Mer bay @PoisDesd10 to showcase how NBC works. These data describe the
lengths between landmarks on the hooks and copulatory organs of gill parasites.
We will attempt to retrieve host and the parasite for a smaple, knowing only its
measurements. The code is available at the end of the post if you are
curious.

In this dataset, the number of parasites retrieved from each host varies, so we
can use the frequency of the host-parasite association as a value for $p(C_k)$:
if about one third of our datapoints are *Lamellodiscus elegans* from the gills
of *Diplodus vulgaris*, then it may not be an unreasonable assumption to think
that unknown samples will have a high probability of also belonging to this
class (of course we may also be sampling my relative ability to spearfish these
species):

Host | Parasite | $n$ |
---|---|---|

D. vulgaris |
L. elegans |
33 |

D. vulgaris |
L. kechemirae |
16 |

D. sargus |
L. ignoratus |
15 |

D. sargus |
L. ergensis |
12 |

D. vulgaris |
L. elegans |
10 |

L. mormyrus |
L. ignoratus |
10 |

When we pick an unkown sample, the next step is to calculate $p(x_1, \dots, x_n|C_k)$, which is the products of the $p(x_i | C_k)$. For the sake of the illustration, we will assume that all values are normally distributed within one host-parasite association, and so the mean and standard deviation are sufficient. Assuming a normal distribution, we get the probability that a value $x$ is drawn from the distribution using its probability density function:

$$p(x|C_k) = \frac{1}{\sqrt{2\pi\sigma_k^2}}\text{exp}\left[-\frac{(x-\mu_k)^2}{2\sigma_k^2}\right],.$$

Once this is done, we can multiply the values for every measurement, then
multiply by the combination relative frequency. After doing this for all
combinations, we pick the one with the highest output. **How did we do**?

This heatmap represents the predictions (rows) versus the real value (columns). The shade of blue represents the number of predicted/observed events. In a perfect world, that is in a world where our classifierd would never make mistakes, the diagonal would be all blue, and the rest would be all white. In this example, about 55% of the instances were correctly classified.

Believe it or not, this is actually not too bad. For one thing, these
morphometric features are a little fuzzy between species [@PoisVern11]. This can
be due to both strong adaptive pressure between hosts, as well as growth and
phenotypic plasticity. In any case, NBC is easy to implement. In datasets like
*Iris*, this gives crazy good results.

**Want to try**? This section is a (terse) walk through the code. The data can
be downloaded from figshare:

```
const url = "http://files.figshare.com/143154/lamellodiscus.txt"
temp_file = download(url)
data_array = readdlm(temp_file, '\t', '\r')
writedlm("lamellodiscus.csv", data_array)
```

The next step is to load the usual group of packages:

```
using DataFrames, CSV, Query
using Distributions
using Plots, StatPlots
```

Then we can import the data in a `DataFrame`

.

```
# Read the file as a DataFrame using missing values
data_array = readdlm("lamellodiscus.csv")
ctypes = Any[Union{Missing,Float64} for i in 1:size(data_array,2)]
ctypes[1:3] = fill(String, 3)
data = CSV.read("lamellodiscus.csv"; header=1, delim="\t", types=ctypes, null="")
# We can remove the sample unique ID
delete!(data, :para)
```

Before starting, we will keep only the parasites with all known features:

```
compl = data[completecases(data),:]
```

And keep only the measurements for individuals belonging to host/parasite associations with more than 10 records.

```
# This line counts the number of host/parasite combinations
cases = by(compl, [:sphote, :sppar], df -> DataFrame(n=size(df, 1)))
# We keep the combinations with more than ten cases
retained = @from i in cases begin
@where i.n >= 10
@select {combination=i.sphote*i.sppar, i.n}
@collect DataFrame
end
# We select the lines from the combinations retained at the previous step
dataset = @from i in compl begin
@where i.sphote*i.sppar in retained[:combination]
@select i
@collect DataFrame
end
```

At this point, what we need to apply NBC is to estimate the distribution of
every morphological feature, for every host/parasite association. We will assume
that these data are normally distributed. We could store the mean and standard
deviation, but this is not really necessary: `julia`

’s `DataFrames`

can store
object of any types, so we will store the distributions themselves:

```
function build_distribution_table(ds)
melted = melt(ds)
distribs = by(melted, [:sphote, :sppar, :variable],
df -> DataFrame(d=Normal(mean(df[:value]),std(df[:value])))
)
out = unstack(distribs, :variable, :d)
return out
end
```

With this code written, we will do a simple leave one out experiment:

```
function leave_one_out(idx)
train = dataset[find(x -> x != idx, 1:size(dataset, 1)),:]
test = dataset[idx,:]
train_d = build_distribution_table(train);
features = melt(test)
delete!(features, [:sphote, :sppar])
prediction = join(features, train_d; on=[:variable])
prediction[:combination] = prediction[:sphote].*prediction[:sppar]
prediction = join(prediction, retained; on=[:combination])
prediction[:p_x] = pdf.(prediction[:d], prediction[:value])
prediction[:p_c] = prediction[:n]./sum(prediction[:n])
prediction = by(prediction, [:sphote, :sppar], df -> DataFrame(
px = prod(df[:p_x]),
pc = unique(df[:p_c]),
pxpc = unique(df[:p_c])*prod(df[:p_x])
))
return prediction
end
```