Doorgaan naar hoofdcontent

Beware of Voodoo Experimentation

In my previous post I described our replication attempt of Experiment 1 from Vohs and Schooler (2008). They found large effects of a manipulation of belief in free will (via the reading of passages) on people’s reported belief in free will and on subsequent cheating behavior. We tried to replicate these findings using Mechanical Turk but obtained null results.

What might account for the stark differences between our findings and those of V&S? And, in the spirit of the educational roots of this project, what lessons can we learn from this attempt at replication?

One obvious difference between our findings and those of V&S is in subject populations. Our subjects had an average age of 33 (range 18-69) and were native speakers of English residing in the US (75 males and 77 females). The distribution of education levels was as follows: high school (13%), college no-degree (33%), associate’s degree (13%), bachelor (33%), and master’s/PhD (8%).

How about the subjects in the original study? V&S used… 30 undergraduates (13 females, 17 males); that’s all it says in the paper. Kathleen Vohs informed us via email that the subjects were undergraduates at the University of Utah. Specifically, they were smart, devoted adults about half of whom were active in the Mormon Church. One would think that it is not too trivial to mention in the paper. After all, free will is not unimportant to Mormons, as is shown here and here. It is quite true that Psychological Science imposes rather stringent word limits but still…

Lesson 1: Critical information should not be omitted from method sections. (This sounds like flogging a dead horse, but try to replicate a study and you’ll see how much information is often missing.)

So there clearly is a difference between our subject population and that of the original experiment. We did not ask about religious affiliation (we did not know this was important, as it was not mentioned in the original paper), but I doubt that we are going to find 30 Mormons, 15 among them active in the Mormon Church, in our sample.

What we can do, however, is match our sample in terms of age (this is also not specified in the original article, but let’s assume late teens to mid-twenties) and level of education. In an analysis of 30 subjects meeting these criteria, we found no significant effects on the manipulation check and cheating behavior.

So differences in age and level of education from the original sample do not seem to account for our null findings. We cannot be sure, however, whether membership in the Mormon Church plays a role.

Another big difference between our experiment and the original is that our experiment was conducted online and the original in the lab. It has been demonstrated that many classical findings in the psychological literature can be replicated using online experiments (e.g., here) but this doesn’t mean online experiments are suitable for any task. 

An obvious issue is that an online study cannot control the environment. To get some idea about the subjects' environment we always ask them to indicate on a 9-point scale the amount of noise in their environment, with 1 being no noise and no distractions and 9 noise and many distractions. The average score was 1.6 on this scale. The majority of subjects (73%) indicated that they were in a quiet environment with no distractions. An additional 11% indicated they were in a quiet environment with some distractions. Very few people indicated being in a noisy environment with distractions. Of course, these are self-report measures but they do suggest that environmental distractions are not a factor.

Perhaps subjects did not read the manipulation-inducing passages. There is no information on this in the original study but we measured reading times. The average reading time for the passages was 380 ms/word, which is quite normal for texts of this type. There were a few subjects with unusually short reading times. Eliminating their data did not change the results. So from what we can tell, the subjects read the texts and did not click through them. There is no information about reading times in the original experiment. In fact, it would have been even better (for both the original study and the replication attempt) to also have comprehension questions about the passages at the end of the experiment.

Lesson 2: gather as much information about your manipulation-inducing stimuli as possible.

Another potential problem, which was pointed out by a commenter on the previous post, is that some subjects on Mechanical Turk, “Turkers,” may already have participated in similar experiments and thus not be na├»ve to the manipulation (see here for a highly informative paper on this topic).

We always ask subjects about their perceptions of the manipulation and this experiment is no exception. We coded a perception as “aware of the manipulation” if it mentioned “honesty”, “integrity”, “pressing the space bar,” “looking at the answer”, “following instructions,” or something similar. We coded someone as “unaware” if they explicitly stated that they had no idea or if the mentioned a different purpose of the experiment. Some examples are: (1) The study was about judgments and quickness, (2) Deterioration of short term memory, and (3) How quickly people can solve math problems.

According to these criteria, about half the subjects were “aware” of the manipulation. We performed a separate follow-up analysis on the “unaware” subjects. There still was no effect of the manipulation on the amount of cheating. We did find a slightly higher number of incidences of cheating among the “aware” subjects than on the “unaware” subjects. All in all, though, the level of cheating was much lower than in the original study.

So does awareness of the manipulation explain our null findings? I don’t think so. Some commenters on the previous post decried our study for having so many “aware” subjects. They should realize that we don't even know if all 30 subjects in the original study believed the cover story; there is no information on this in the article.

Lesson 3: always ask subjects about their perceptions of the purpose of the experiment.

I find it hard to believe that the subjects in the original experiment all bought the cover story. Unlike in our experiment, the original study has no information on how many people disbelieved the cover story. Some commenters have suggested that it is easier to convince people of the cover story if you have an actual experimenter. This seems plausible although it still doesn't seem likely to me that everyone would have believed the story. And of course it would be an awful case of circular reasoning to say that the subjects must have believed the manipulation simply because there was a large effect.

But there is a bigger point. If the large effect reported in the original study hinges on the acting skills of the experimenter, then there should be information on this in the paper. The article merely states that the subjects were told of the glitch. We incorporated what the students were told in our instruction. But if it is not the contents of what they were told that is responsible for the effect but rather the manner in which it is told, then there should be information on this. Did the experimenter act distraught, confused, embarrassed, or neutrally? And was this performance believable and delivered with any consistency? If the effect hinges on the acting skills of an experimenter, experimentation becomes an art and not a science. In addition to voodoo statistics, we would have voodoo experimentation. (A reader of this post pointed me to this highly relevant article on the ubiquity of voodoo practices in psychological research.)

It should be obvious but I’d like to state it explicitly anyway, I’m not saying that V&S performed voodoo experimentation. I am just saying that if the claim is that the effect relies on factors that are not (or cannot be) articulated and documented—and I’ve heard people (not V&S) make this claim—then we have voodoo experimentation.

Lesson 4: Beware of Voodoo Experimentation

It is striking that we were not even able to replicate the manipulation check that V&S used. I was told by another researcher (who is also performing a replication of the V&S experiment) that the reliability of the original manipulation check is low (we had not thought to examine this, but we did use the updated version of this scale, the FAD-plus). I do not want to steal this researcher’s thunder, and so will not say anything more about this issue at this point (I will provide an update as soon as the evidence from the that researcher's experiment is available). But the fact that we did not replicate the large effect on the manipulation check that was reported in the original study might not count as a strike against our replication attempt.

So where does this leave us? <p><span style="display:none">claimtoken-515be493dc514</span></p> The fact that the large (!) effect of the original study completely evaporated in our experiment cannot be due to (1) the age or education levels of the subjects, (2) subjects not reading the manipulation-inducing passages (if reading times are any indicator), and (3) subjects’ awareness of the manipulation. The original paper provides no evidence regarding these issues.

The evaporation of the effect could, however, be due to (1), the special nature of the sample of the original sample (2) the undocumented acting skills of a real-life experimenter (voodoo experimentation), or of course (3) the large effect being a false positive. I am leaning towards the third option, although I would not find a small effect implausible (in fact, that is what I was initially expecting to find).


  1. Nice discussion of an important issue. Methods are the parts of experiments that we can control, and we need to pay a lot more attention to the details. I wanted to emphasize one other point. Due to the nature of random sampling, we cannot easily rule out statistical errors. V&S might have made a Type I error. There is no shame in that, and it must happen some times. Likewise, the replication study might have been a Type II error, and there is no shame in that either.

    I think it is not possible to separate statistical errors and methodological differences without a theory. If your theory says that the methodological differences should not matter, then you should pool the experimental results together to get your best estimate of the strength of the effect.

    If your theory says that the methodological differences do matter, then a lot more experimental work is required to demonstrate those differences.

  2. It will be great when researchers begin publishing more information about the methods. For example reaction times in the Qualtrics program are less reliable than RTs in programs designed for this purpose. Using one program or another is thus quite important but most people don't report such information (I don't think this matters so much for the RTs you report above, just a general example). Hopefully norms will change.

    What does the failed manipulation check mean for making inferences about the rest of the study? But perhaps you don't want to answer because of the thunder?

    I think the failed manipulation check is the hardest thing to "explain away" due to differences in samples and procedures. But it also makes it hard to know how best to think about the primary effect.

    As with the original voodoo paper, I think the title is over the top. Common sense suggests that some aspects of any given experiment would be more or less believable and taken more or less seriously in the lab vs. online. As I said before, this specific issue is especially about the DV and I don't think that it can easily account for the failed manipulation check.

    1. Deze reactie is verwijderd door de auteur.

    2. Blogs can have snappy titles; articles need informative ones:

  3. It could be a another case of the dwindling effect-size phenomena

  4. Yes, and of course Jonathan Schooler, the second author of the free-will study, has done interesting and important work in this area.

    1. Deze reactie is verwijderd door de auteur.

    2. We ran it, submitted it to Psych Science. It was rejected (both original authors were among the reviewers). We will probably submit somewhere else but maybe I should also write a blog post about it. That second replication (run in the lab) also didn't show an effect (big surprise). One of the other reviewers wanted us to also replicate Vohs & Schooler's other experiment, which I found silly, especially after learning that there was an error in that experiment that was never corrected (the effect size is much smaller than reported in the paper).

    3. That's interesting, thank you for your reply. I'm particularly intrigued by the drop in cheating that you observed in both conditions, compared to the original results. I must admit, when reading your first post and before learning about the replication's outcome, I had a feeling the manipulation would go wrong once I read the experiment had been run online.
      I'd be curious to know if the lab results approximated those from the online study at all?

    4. We observed the same drop in the lab replication (Experiment 2) and also no effect. From the manuscript: "There is no difference in cheating between the AFW condition (M = 3.77, SD = 3.27) and the control condition (M = 2.93, SD = 3.99), t (58) = 0.88, p = .380, Cohen’s d = .23)."

    5. We observed the same drop in the lab replication (Experiment 2) and also no effect. From the manuscript: "There is no difference in cheating between the AFW condition (M = 3.77, SD = 3.27) and the control condition (M = 2.93, SD = 3.99), t (58) = 0.88, p = .380, Cohen’s d = .23)."

  5. Deze reactie is verwijderd door de auteur.


Een reactie posten