It's time to retire the iris dataset

Data Science meh, Eugenics bad

Almost everyone who has been interested in machine learning had to work with the iris dataset, and I have been thinking about it more than usual in the weeks leading up to a very cool data science thing I’m not supposed to talk about yet. It’s time we stop using iris entirely.

The iris dataset represents four measurements of floral morphology on 150 plants, 50 individuals for each of three genus (I. versicolor, I. setosa, and I. virginica). It was measured near Gaspé, in a strikingly beautiful part of Québec, at some point in the 1930s. With the exception of one or two points, the classes are linearly separable, and so classification algorithms reach almost perfect accuracy.

Let’s get this issue out of the way first: iris is boring, as the real world is ever so rarely linearly separable. It has too few variables to talk about feature engineering or variable selection, and everything is linearly correlated (because of allometric scaling), and so the amount of information brought in by extra variables is limited. There is nothing to learn from iris, at least not that has practical relevance, besides how to write code to apply a given model. I do not like it very much.

Edit: “collected”, in the next paragraph, is to be taken as “gathered for the purpose of analysis”, and not “sampled in the field”. The original dataset was sampled by Edgar Anderson, and there is no indication of any particular intent other than botany work behind the original sampling.

That is is almost offensively quaint is not my main gripe with this dataset. It would bore me to tears, but I could possibly spend an entire term teaching with it. But I will not; for you see, iris was collected and first published with the express intent to advance the science of eugenics, and parading it around in 2020 is an unacceptable endorsement of a repulsive (but still very much alive) way to subsume science under ideology. For this reason, iris should not be appearing in teaching material with such ubiquity, unless it is to remind students that white supremacy has always tried to use quantitative methods to push its agenda, and that quantitative sciences have a foundation in providing arguments to scientific racism and classism.

The original iris paper was published in the Annals of Eugenics, in 1936, by one Ronald Fisher. Fisher was a vocal proponent of eugenics, involved in learned societies on eugenics. One of the points of the paper (and of the journal, and of Fisher’s leading role in developing biometry and biostatistics) was to propose a methodological framework to delineate desirable traits, in support of eugenics programs. One does not publish in the Annals of Eugenics in 1936 on a misunderstanding.

By using this dataset in 2020, we are sending a very strong message. Maybe we do not care about the role of science in creating and re-inforcing inequalities and structures of oppression. Maybe we are at peace with the fact that a lot of early quantitative techniques in the biological sciences have been designed to support eugenics and grant it legitimacy, as phrenology did before. Maybe we are eager to pardon the racism of pioneers of the field if their contributions are important enough. Maybe we just don’t care about the social consequences of our science. In any case, just as much as one does not publish in Annals of Eugenics by accident, the decision to keep using this dataset is an endorsement, albeit an implicit one, of being able to draw a straight line from mainstream academic science to white supremacy. Every time we use iris as foundational in data science education, we are re-drawing this straight line, over and over again. Whether we intend to do so matters little, if at all.

We have alternatives to iris who do not share the same problem. The wheat seeds dataset, for example, is far more interesting, and does not have the same ideological red flags attached to it. Every time we decide to use iris in the classroom, we are putting our habits before our duty to consider the ethical implications of our work. We should do better, and immediately dropping iris is one easy way to start this work. This is not to say that we should sweep iris under the rug. It would be an equally grave mistake; we must discuss it, we must acknowledge the ideology that produced it, and we must decide if this is the ideology we want to bring into data science education.