Almost everyone who has been interested in machine learning had to work with the
iris dataset, and I have been thinking about it more than usual in the weeks
leading up to a very cool data science thing I’m not supposed to talk about yet.
It’s time we stop using
iris dataset represents four measurements of floral morphology on 150
plants, 50 individuals for each of three genus (I. versicolor, I. setosa,
and I. virginica). It was measured near Gaspé, in a strikingly beautiful part
of Québec, at some point in the 1930s. With the exception of one or two points,
the classes are linearly separable, and so classification algorithms reach
almost perfect accuracy.
Let’s get this issue out of the way first:
iris is boring, as the real world
is ever so rarely linearly separable. It has too few variables to talk about
feature engineering or variable selection, and everything is linearly correlated
(because of allometric scaling), and so the amount of information brought in by
extra variables is limited. There is nothing to learn from
iris, at least not
that has practical relevance, besides how to write code to apply a given model.
I do not like it very much.
Edit: “collected”, in the next paragraph, is to be taken as “gathered for the purpose of analysis”, and not “sampled in the field”. The original dataset was sampled by Edgar Anderson, and there is no indication of any particular intent other than botany work behind the original sampling.
That is is almost offensively quaint is not my main gripe with this dataset.
It would bore me to tears, but I could possibly spend an entire term teaching
with it. But I will not; for you see,
iris was collected and first published
with the express intent to advance the science of eugenics, and parading it
around in 2020 is an unacceptable endorsement of a repulsive (but still very
much alive) way to subsume science under ideology. For this reason,
should not be appearing in teaching material with such ubiquity, unless it is to
remind students that white supremacy has always tried to use quantitative
methods to push its agenda, and that quantitative sciences have a foundation in
providing arguments to scientific racism and classism.
iris paper was published in the Annals of Eugenics, in 1936, by
one Ronald Fisher. Fisher was a vocal proponent of eugenics, involved in learned
societies on eugenics. One of the points of the paper (and of the journal, and
of Fisher’s leading role in developing biometry and biostatistics) was to
propose a methodological framework to delineate desirable traits, in support of
eugenics programs. One does not publish in the Annals of Eugenics in 1936 on a
By using this dataset in 2020, we are sending a very strong message. Maybe we do
not care about the role of science in creating and re-inforcing inequalities and
structures of oppression. Maybe we are at peace with the fact that a lot of
early quantitative techniques in the biological sciences have been designed to
support eugenics and grant it legitimacy, as phrenology did before. Maybe we are
eager to pardon the racism of pioneers of the field if their contributions are
important enough. Maybe we just don’t care about the social consequences of our
science. In any case, just as much as one does not publish in Annals of
Eugenics by accident, the decision to keep using this dataset is an
endorsement, albeit an implicit one, of being able to draw a straight line from
mainstream academic science to white supremacy. Every time we use
foundational in data science education, we are re-drawing this straight line,
over and over again. Whether we intend to do so matters little, if at all.
We have alternatives to
iris who do not share the same problem. The wheat
seeds dataset, for example, is
far more interesting, and does not have the same ideological red flags attached
to it. Every time we decide to use
iris in the classroom, we are putting our
habits before our duty to consider the ethical implications of our work. We
should do better, and immediately dropping
iris is one easy way to start this
work. This is not to say that we should sweep
iris under the rug. It would be
an equally grave mistake; we must discuss it, we must acknowledge the
ideology that produced it, and we must decide if this is the ideology we want
to bring into data science education.