False negatives and the reproducibility crisis

Yesterday, Dominique Roche gave a seminar on open science, which he argues is a way to help with the current reproducibility crisis. I overwhelmingly agree about the message; but hearing some of the arguments brought me back to a point that is not as frequently adressed as the rest: what happens to false negatives?

The current goal of the reproducibility crisis is to check that significant results are indeed significant; in brief, we can classify all of science in the following matrix:

Really significant Really null
Measured significant true positive false positive
Measured null false negative true negative

The ideal solution is to have all of our results on the “true” diagonal. Currently, reproducibility is mostly about assessing the status of results from the first row (they have been measured as being significant), and deciding whether they fall within the first column (we were right to measure them as significant) or in the second (it was a fluke).

But let’s look at the first column. In the second row are results that we measured as not being significant, even though they actually are. Reproducing them might change their status from false negatives to true positive.

There is nothing in this approach that would differ from the current workflows aroung reproducing published results. Except for one tiny little detail: we don’t publish the results from the second row. Both as a consequence of journal policies (or unchecked reviewer behavior), it is much easier to publich significant results (true or false), than negative ones (true or false).

In addition, we as researchers are quick to self-select. When I work on project, what ends up in the paper is much less than one tenth on what ends up on the cutting floor. There are a lot of trials that don’t pan out, numerical experiments that are not interesting, and yes, pilots during which I don’t see a strong enough effect.

Ideally, reproducibility in science should strike a delicate balance between keeping the pace of progress reasonable by checking for false positives, but also ensuring that as much true positives are published by re-assessing negative results. It is much less glamorous than running after the New Thing, but it seems to make good statistical sense.