Naive Bayesian Classification of parasite data

Classification is the task where, knowing some values or measurements of a thing, we decide which category we should put this thing into. People that are more serious about their nomenclature than I am would talk about labels, instances, and features. Since I am updating my lecture notes on the technique, I will share an explanation and an example using parasite data.

During my Master’s, I worked on the morphometry of fish parasites from the Monopisthocotylea subclass. The way to identify most of these species is to put them into the microscope, look very closely as the few solid parts of the hooks (used to attached to the gills or skin) and reproductive organs, and in some cases, take measures of the solid parts to compare against reference measurements.

Dactylogyrus This is what we can call a “classification” task. Given information about a sample (a parasite), we want to find the category (the species) to which it belongs. Naive Bayesian classifiers (NBC, which I’ll use both for the classifiers themselves and for the method) use some assumptions to simplify the data structure to provide a prediction about the most likely class. The name of the method gives all the information we need. It is naive (but it’s ok, because data are unreasonably effective @HaleNorv09 anyways), because it makes simplications and assumptions. It is Bayesian (but not that much Bayesian), because it relies on conditional probabilities. And it does classification. Let’s dig in.

In the case of assigning parasites to their species based on measurements, what we want is the probability that the individual (our problem instance) belongs to each species (the labels in our classification problem), given the measurements (the features) we know about. Or in other words, $p(C_k|x_1, \dots, x_n)$., where $C_k$ is a species, and $x_1, \dots, x_n$ are the measurements for the sample. This is where the NBC becomes Bayesian: we can rewrite this using Bayes' theorem,

$$p(C_k|\mathbf{x})=\frac{p(C_k)p(\mathbf{x}|C_k)}{p(\mathbf{x})} ,.$$

If we unpack this formula, $p(C_k)$ is the prior probability of class $k$, and $p(\textbf{x}|C_k)$ is the likelihood. Let’s talk about $p(\textbf{x})$ for a minute. This terms represents the probability of data. Because we know the values of the features we have measured, this term will be a constant for all classes. And because we are interested in the most likely class, we can remove it from the expression. We end up with

$$p(C_k|\mathbf{x}) \propto p(C_k)p(\mathbf{x}|C_k),.$$

At this point, NBC becomes Naive: we will assume the conditional independance of all features. Of course nothing in biology is conditionally independant, and especially not morphomological features. But we can assume that they are, and see where this leads us. If all features are independant, then we can rewrite $p(\textbf{x}|C_k)$ as the much easier to deal with $\prod_i p(x_i|C_k)$. We plug back this expressing in the previous formula, and end up with

$$p(C_k|\mathbf{x}) \propto p(C_k)\prod_i p(x_i|C_k),.$$

The final step is to apply the maximum a posteriori decision rule, which is a fancy way of saying that we take the class with the higher probability. Because we got rid of $p(\mathbf{x})$, this is only proportional to the actual probability. In any cas, we assign this sample to the class $\hat y$, where

$$\hat y = \text{argmax}\left(p(C_k)\prod_i p(x_i|C_k)\right) ,.$$

I will now use the data from Lamellodiscus on sparid fishes in the Banyuls-sur-Mer bay @PoisDesd10 to showcase how NBC works. These data describe the lengths between landmarks on the hooks and copulatory organs of gill parasites. We will attempt to retrieve host and the parasite for a smaple, knowing only its measurements. The code is available at the end of the post if you are curious.

In this dataset, the number of parasites retrieved from each host varies, so we can use the frequency of the host-parasite association as a value for $p(C_k)$: if about one third of our datapoints are Lamellodiscus elegans from the gills of Diplodus vulgaris, then it may not be an unreasonable assumption to think that unknown samples will have a high probability of also belonging to this class (of course we may also be sampling my relative ability to spearfish these species):

Host Parasite $n$
D. vulgaris L. elegans 33
D. vulgaris L. kechemirae 16
D. sargus L. ignoratus 15
D. sargus L. ergensis 12
D. vulgaris L. elegans 10
L. mormyrus L. ignoratus 10

When we pick an unkown sample, the next step is to calculate $p(x_1, \dots, x_n|C_k)$, which is the products of the $p(x_i | C_k)$. For the sake of the illustration, we will assume that all values are normally distributed within one host-parasite association, and so the mean and standard deviation are sufficient. Assuming a normal distribution, we get the probability that a value $x$ is drawn from the distribution using its probability density function:

$$p(x|C_k) = \frac{1}{\sqrt{2\pi\sigma_k^2}}\text{exp}\left[-\frac{(x-\mu_k)^2}{2\sigma_k^2}\right],.$$

Once this is done, we can multiply the values for every measurement, then multiply by the combination relative frequency. After doing this for all combinations, we pick the one with the highest output. How did we do?

Confusion matrix

This heatmap represents the predictions (rows) versus the real value (columns). The shade of blue represents the number of predicted/observed events. In a perfect world, that is in a world where our classifierd would never make mistakes, the diagonal would be all blue, and the rest would be all white. In this example, about 55% of the instances were correctly classified.

Believe it or not, this is actually not too bad. For one thing, these morphometric features are a little fuzzy between species [@PoisVern11]. This can be due to both strong adaptive pressure between hosts, as well as growth and phenotypic plasticity. In any case, NBC is easy to implement. In datasets like Iris, this gives crazy good results.

Want to try? This section is a (terse) walk through the code. The data can be downloaded from figshare:

const url = ""
temp_file = download(url)
data_array = readdlm(temp_file, '\t', '\r')
writedlm("lamellodiscus.csv", data_array)

The next step is to load the usual group of packages:

using DataFrames, CSV, Query
using Distributions
using Plots, StatPlots

Then we can import the data in a DataFrame.

# Read the file as a DataFrame using missing values
data_array = readdlm("lamellodiscus.csv")
ctypes = Any[Union{Missing,Float64} for i in 1:size(data_array,2)]
ctypes[1:3] = fill(String, 3)
data ="lamellodiscus.csv"; header=1, delim="\t", types=ctypes, null="")

# We can remove the sample unique ID
delete!(data, :para)

Before starting, we will keep only the parasites with all known features:

compl = data[completecases(data),:]

And keep only the measurements for individuals belonging to host/parasite associations with more than 10 records.

# This line counts the number of host/parasite combinations
cases = by(compl, [:sphote, :sppar], df -> DataFrame(n=size(df, 1)))

# We keep the combinations with more than ten cases
retained = @from i in cases begin
  @where i.n >= 10
  @select {combination=i.sphote*i.sppar, i.n}
  @collect DataFrame

# We select the lines from the combinations retained at the previous step
dataset = @from i in compl begin
  @where i.sphote*i.sppar in retained[:combination]
  @select i
  @collect DataFrame

At this point, what we need to apply NBC is to estimate the distribution of every morphological feature, for every host/parasite association. We will assume that these data are normally distributed. We could store the mean and standard deviation, but this is not really necessary: julia’s DataFrames can store object of any types, so we will store the distributions themselves:

function build_distribution_table(ds)
  melted = melt(ds)
  distribs = by(melted, [:sphote, :sppar, :variable],
    df -> DataFrame(d=Normal(mean(df[:value]),std(df[:value])))
  out = unstack(distribs, :variable, :d)
  return out

With this code written, we will do a simple leave one out experiment:

function leave_one_out(idx)
    train = dataset[find(x -> x != idx, 1:size(dataset, 1)),:]
    test = dataset[idx,:]
    train_d = build_distribution_table(train);

    features = melt(test)
    delete!(features, [:sphote, :sppar])
    prediction = join(features, train_d; on=[:variable])
    prediction[:combination] = prediction[:sphote].*prediction[:sppar]
    prediction = join(prediction, retained; on=[:combination])
    prediction[:p_x] = pdf.(prediction[:d], prediction[:value])
    prediction[:p_c] = prediction[:n]./sum(prediction[:n])

    prediction = by(prediction, [:sphote, :sppar], df -> DataFrame(
            px = prod(df[:p_x]),
            pc = unique(df[:p_c]),
            pxpc = unique(df[:p_c])*prod(df[:p_x])

    return prediction