On the emptiness of failed replications: the best parts

← back to the blog

I've had essentially no time for reading lately as I'm doing SAHD thing, moving into a new house, and coding like a maniac whenever I get a spare hour, but there was so much buzz on Twitter around Jason Mitchell's essay about replication studies in social psychology that I had to take a half-hour to read it. And for all that is holy, let me just say: What. The. Fuck. That's 30 min of my life I can't get back. I will replicate the highlights below so that you don't have to read it yourself:

Although the notion that negative findings deserve equal treatment may hold intuitive appeal, the very foundation of science rests on a profound asymmetry between positive and negative claims. Suppose I assert the existence of some phenomenon, and you deny it; for example, I claim that some non-white swans exist, and you claim that none do (i.e., that no swans exist that are any color other than white). Whatever our a priori beliefs about the phenomenon, from an inductive standpoint, your negative claim (of nonexistence) is infinitely more tenuous than mine. A single positive example is sufficient to falsify the assertion that something does not exist; one colorful swan is all it takes to rule out the impossibility that swans come in more than one color. In contrast, negative examples can never establish the nonexistence of a phenomenon, because the next instance might always turn up a counterexample. Prior to the turn of the 17th century, Europeans did indeed assume that all swans were white. When European explorers observed black swans in Australia, this negative belief was instantly and permanently confuted. Note the striking asymmetry here: a single positive finding (of a non-white swan) had more evidentiary value than millennia of negative observations. What more, it is clear that the null claim cannot be reinstated by additional negative observations: rounding up trumpet after trumpet of white swans does not rescue the claim that no non-white swans exists [sic].

The graduate caucus of the ecology department at my undergrad institution published a handbook with a cover image that was just the text "p = 0.06" in huge, friendly letters. Ecology data can take ages to collect, and the right-of-passage of many graduate students was to go through months of collection, analyze the data, and get a (barely) non-significant result. Generally speaking, the scientific hierarchy does not reward non-significant results, and Mitchell tells us why: people who get them are just incompetent scientists.

Mitchell's implicit assumption - throughout the essay - is that all statistical effects are real, even though the very nature of frequentist statistics forces us to accept that some percentage of positive results are false. Mitchell does not seem to understand statistics or the scientific method. He is searching for swans.

[Other replication proponents agree that] any small number of failed experiments cannot dislodge a positive finding, but argues that we can nevertheless learn something important from the distribution of effect sizes obtained using similar methods. However, for most of the reasons above, such distributions will mainly describe the potency of our methods and those who use them, not the "realness" of an effect. To illustrate this, imagine that I ask a hundred people each to experimentally determine the relation between the temperature and pressure of a gas. Although the "actual" relation is perfectly linear, the group will generate a distribution of effect sizes, and this distribution will depend entirely on the experimental skill of the researchers: a hundred physics graduate students using state-of-the-art equipment will generate a different distribution than a group of seventh-graders working out of their kitchens. In other words, distributions of effect sizes are no less dependent on the experimenters generating them than are single-point estimates. Such distributions can, in a limited way, tell us something about the efficacy of experimenters and their methods, but they cannot be dispositive about whether a phenomenon "exists" or not. A repository of all attempted experiments might benefit our field, but only in the limited way of suggesting which methods—and experimenters—may be more or less robust, and not by bearing on the existence of a phenomenon.

The pharmaceutical industry will be very relieved. The lowest p-values are just the most "right".

And finally, the problem:

How do we identify replications that fail simply because of undetected experimenter error?

Uh, do it again?

I agree with Mitchell about one thing: one should not do replication studies in an attempt to smear the original authors. If a replication "fails" (a poor word, because it does not distinguish between the experiment being done badly and being done well but failing to find a positive effect), something is fishy, and both groups should work together to find out what happened. That advances science. If they can't, so be it. Working at the bleeding edge of knowledge means that some things "might" be true.

It's not even the idea that experiments should not be replicated that bugs me, it's the idea that null results have no value. It's the idea that negative results can be the result of error, but positive results can't. In my own scientific experience, this is simply not the case (admittedly in a different field, but from my limited knowledge isn't social psychology more likely to be subject to systemic biases than fields where the research subject is not also a person?) Publication bias, the idea that positive results should be celebrated while negative ones are useless, has led to a host of problems across multiple fields, including incentivizing the fraud that Mitchell thinks is so rare.