Sunday, January 18, 2015

When Replicating Stapel is not an Exercise in Futility

Over 50 of Diederik Stapel’s papers have been retracted because of fraud. This means that his “findings,” have now ceased to exist in the literature. But what does this mean for his hypotheses?*

Does the fact that Stapel has committed fraud count as evidence against his hypotheses? Our first inclination is perhaps to think yes. In theory, it is possible that Stapel ran a number of studies, never obtained the predicted results, and then decided to take matters into his own hands and tweak a few numbers here and there. If there were evidence of a suppressed string of null results, then yes, this would certainly count as evidence against the hypothesis; it would probably be a waste of time and effort to try to “replicate” the “finding.” Because the finding is not a real finding, the replication is not a real replication. However, by all accounts (including Stapel’s own), once he got going, Stapel didn’t bother to run the actual experiment. He just made up all the data.

This means that Stapel’s fraud has no bearing on his hypotheses. We simply have no empirical data that we can use to evaluate his hypotheses. It it still possible that a hypothesis of his is supported in a proper experiment. Whether or not it makes sense to test that hypothesis is purely a matter of theoretical plausibility. And how do we evaluate replication attempts that were performed before the fraud had come to light? At the time the findings were probably seen as genuine--they were published, after all.

Prior to the exposure of Stapel’s fraudulent activities, Dutch social psychologist Hans IJzerman and some of his colleagues had embarked on a cross-cultural project, involving Brazilian subjects, that built on one of Stapel's findings. They then found out that another researcher in the Netherlands, Nina Regenberg, had already tried—and failed—to replicate these same findings in 9 direct and conceptual replications. As IJzerman and colleagues wryly observe:

At the time, these disconfirmatory findings were seen as ‘failed studies’ that were not worthy of publication. In hindsight, it seems painfully clear that discarding null effects in this manner has hindered scientific progress.

Ironically, the field that made it possible for Stapel to publish his made-up findings also made it impossible to publish failed replications of his work that involved actual findings. 

But the times they are a-changin’. IJzerman and Regenberg joined forces and together with their colleagues Justin Saddlemyer and Sander Koole they have written a paper, currently in press in Acta Psychologica, that reports 12 replications of a—now retracted—series of experiments published by Diederik Stapel and Gün Semin. Semin, of course, was unaware of Stapel's deception.**

Here is the hypothesis that was advanced by Stapel and Semin: priming with abstract linguistic categories (adjectives) should lead to a more abstract perceptual focus, whereas priming with concrete linguistic categories (action verbs) should lead to a more concrete perceptual focus. This linguistic category priming hypothesis is based on the uncontroversial observation that specific linguistic terms are recurrently paired with specific situations. As a result, Stapel and Semin hypothesized, linguistic terms may form associative links with cognitive processes. Because these associative links are stored in memory, they may be activated or “primed” whenever people encounter the relevant linguistic terms.

Stapel and Semin further hypothesized that verbs are associated with actions at a more concrete level than nouns. A verb like hit is used in a context like Harry is hitting Peter whereas an adjective like agressive is used in a more abstract description of the situation, as in Harry is being aggressive toward Peter. Because abstract information is more general, it may be associated with global perceptions, whereas concrete information may become with local perceptions. So far so good; I bet that many psychologists can follow this reasoning. Due to these associations, Stapel and Semin reason, priming verbs may elicit a focus on local details (i.e., the trees), while priming adjectives may elicit a focus on the global whole (i.e., the forest). This is a bit of a leap for me but let’s follow along.

Stapel and Semin reported four experiments in which they found evidence supporting their hypothesis. Priming with verbs led to more concrete processing than priming with adjectives. But of course these experiments were actually never performed and the findings were fabrications.

Let's look at some real data. Here is IJzerman et al.'s forest plot of the standardized mean difference scores between verb and adjective primes on global vs. local focus in twelve replications of the Stapel and Semin study.

Of the 12 studies, only one showed a significant effect (and it was not in the predicted direction). Overall, the standardized mean difference between the condition was practically zero. No shred of support for the linguistic category priming hypothesis, in other words.

Are these findings the death blow (to use the authors’ term) to the notion of linguistic category priming? IJzerman and his colleagues don’t think so. In perhaps a surprise twist, they conclude:

[I]t remains to be seen whether the effect we have investigated does not exist, or whether it depends on identifying the right contexts and measurements for the linguistic category priming effects among Western samples.

My own conclusions are the following.

  1. Replications of findings proven to be fraudulent are important. Without replications, the status of the hypotheses remains unclear. After all, the findings were previously deemed publishable by peer reviewers, presumably based in part on theoretical considerations. Without relevant empirical data, the area of research will remain tainted and researchers will steer clear from it. While this may not be bad in some cases, it might be bad in others.
  2. The Pottery Barn rule should hold in scientific publishing: you break it, you buy it. If you published fraudulent findings, you should also publish their nonreplications. Many journals do not adhere to this rule. Sander Koole informed me that the Journal of Social and Personality Psychology (JPSP) congratulated IJzerman and colleagues on their replication attempts but rejected their manuscript nonetheless, even though they had previously published the Stapel and Semin paper. It is a good thing the editors at Acta Psychologica have taken a more progressive stance on publishing failed replications.***
  3. It is a good sign that the climate for the publications failed replications is improving somewhat. Dylan’s right, the times are a-changin'. I am glad that the authors persevered and that their work is seeing the light of day.

*     I thank Hans IJzerman and Sander Koole for feedback on a previous version of this post. 
**   Semin was the doctoral advisor of both IJzerman and Regenberg and was initially involved in the replication attempts but let his former students use the data.
*** Until January 2014 I was Editor-in-Chief at Acta Psychologica. I was not involved in the handing of the IJzerman et al. paper and am therefore not patting myself on the back.


  1. In his book, Stapel describes receiving emails from other researchers who had successfully "replicated" studies he had faked. I wonder if they had trouble getting published because "we don't do replications"? (Maybe they could resubmit now, perhaps without mentioning their original inspiration?)

    Of course, it makes sense that Stapel's studies might replicate, because as he said himself, he never proposed anything especially counterintuitive. His theoretical introductions generally sound plausible, reasonable, the kind of stuff you expect to find in a social psychology paper. (That said, I also find the hypotheses in the study here to require a bit of a leap. As a total layperson, it just seems unlikely that priming with verbs versus adjectives would have much of an effect; I would expect both to evoke the root meaning. But perhaps the linguistic psychologist here has some literature to hand!)

    1. It was beyond the metapsychological scope of this post but a cognitive psychologist interested in language, I wouldn't be comfortable comparing verbs and adjectives in the first place due to the many potential confounds (e.g., meaning, syntactic category, frequency, length, age of acquisition, and so on).

  2. Non-replication is likely also important to reduce the level of "Pathological Science" a term coined by Langmuir and about which I posted on in relation to reproducibility in chemistry