Explaining algorithms is not easy

Last month, HBR published an article whose central point is that making algorithms more transparent may eventually “backfire”. The example they use is interesting: a professor used a method to correct the statistical effects of having several TAs grade the same assignment, explained it to the students, and witness what was apparently a small-scale riot. From here, the authors jump to the conclusion that transparency of algorithmic processes is not necessarily desirable. This sort of conclusion follows only if you are willing to conflate two of the different ways by which models can be made transparent.

I am always impressed by how much students love decision trees. Among the different machine learning techniques I give an overview of during one class, this is the one that is used most often during projects, and that the students seem to get very rapidly. This is surprising, because the underlying mathematics are arguably more complex that k-means, kNN, or other techniques. But decision trees are transparent in that their output is easy to reason about. If you ask a decision tree to recommend a label based on features, you can follow along at every node to understand how the recommendation was reached. This is the first component of model transparency: the ability to reason about the output.

An interesting counterpoint is neural networks (especially the very basic ones, with a single hidden layer). If you give me 45 minutes to get you up to speed on partial derivatives and matrix multiplication, I can probably do an adequate job of conveying how they work. Give me a full hour and I will instead tell you to watch Grant Sanderson’s 4-parts series on the topic, because this is a work of art. But even with this understanding, the output of even the simplest neural network is very difficult to think about. This is the second component of model transparency: the ability to reason about how the inner workings leading to the output.

The frontier between the two components is of course very fuzzy; but being transparent in one way does not guarantee transparency in the other. The kNN algorithm is crystal clear in both its inner workings and its output. Decision trees have transparent outputs but opaque internals. Simple neural networks have opaque outputs but transparent machinery to produce them.

One of these components (is the output easy to reason about?) is quite unmovable. All we can change is the transparency of the other one: how does it works, internally? My favorite approach (with students) is chalk, but carefully chosen words, or better yet, well-commented code, can make a complex routine limpid. Which brings me back to the original point, of students complaining about an algorithm simple as it is, being involved in their grading: the best way to make an algorithm opaque is to do a bad job of explaining it (not that I am implying it was the case, etc).

When using algorithms, we should always strive for more transparency overall – because the epistemic opacity of some methods is probably an unbreakable constraint. If increased transparency results in opposition (either in the form of refusing the outcome, or doubting the suitability of the method), this is information to use. In ecology, machine learning techniques will become increasingly important, and especially in decision making. Keeping on describing them as black boxes that solve all problems is peddling snake oil. Demanding that they remain black boxes in which no one should look, lest some people start questioning how they work, is irresponsible. The best way to make sure the future biologists will be able to peer into these algorithms is to train them to understand how they work.