In my previous post I described our replication attempt of
Experiment 1 from Vohs and Schooler (2008). They found large effects of a
manipulation of belief in free will (via the reading of passages) on people’s
reported belief in free will and on subsequent cheating behavior. We tried to
replicate these findings using Mechanical Turk but obtained null results.
What might account for the stark differences between our
findings and those of V&S? And, in the spirit of the educational roots of
this project, what lessons can we learn from this attempt at replication?
One obvious difference between our findings and those of
V&S is in subject populations. Our subjects had an average age of 33 (range 18-69) and were
native speakers of English residing in the US (75 males and 77 females). The
distribution of education levels was as follows: high school (13%), college
no-degree (33%), associate’s degree (13%), bachelor (33%), and master’s/PhD
(8%).
How about the subjects in the original study? V&S used… 30 undergraduates (13 females, 17 males);
that’s all it says in the paper. Kathleen Vohs informed us via email that the subjects were undergraduates at the University of Utah. Specifically,
they were smart, devoted adults about
half of whom were active in the Mormon Church. One would think that it is
not too trivial to mention in the paper. After all, free will is not
unimportant to Mormons, as is shown here and here. It is quite true that
Psychological Science imposes rather
stringent word limits but still…
Lesson 1: Critical
information should not be omitted from method sections. (This sounds like
flogging a dead horse, but try to replicate a study and you’ll see how much
information is often missing.)
So there clearly is a difference between our subject
population and that of the original experiment. We did not ask about religious
affiliation (we did not know this was important, as it was not mentioned in the
original paper), but I doubt that we are going to find 30 Mormons, 15 among
them active in the Mormon Church, in our sample.
What we can do, however, is match our sample in terms of age
(this is also not specified in the original article, but let’s assume late
teens to mid-twenties) and level of education. In an analysis of 30 subjects
meeting these criteria, we found no significant effects on the manipulation check
and cheating behavior.
So differences in age and level of education from the
original sample do not seem to account for our null findings. We cannot be sure,
however, whether membership in the Mormon Church plays a role.
Another big difference between our experiment and the
original is that our experiment was conducted online and the original in the
lab. It has been demonstrated that many classical findings in the psychological
literature can be replicated using online experiments (e.g., here)
but this doesn’t mean online experiments are suitable for any task.
An obvious issue is that an online study cannot control the environment. To get some idea about the subjects' environment we always ask them to indicate on a 9-point scale
the amount of noise in their environment, with 1 being no noise and no distractions and 9 noise and many distractions. The average score was 1.6 on this scale. The
majority of subjects (73%) indicated that they were in a quiet environment with
no distractions. An additional 11% indicated they were in a quiet environment
with some distractions. Very few people indicated being in a noisy environment
with distractions. Of course, these are self-report measures but they do
suggest that environmental distractions are not a factor.
Perhaps subjects did not read the manipulation-inducing
passages. There is no information on this in the original study but we measured
reading times. The average reading time for the passages was 380 ms/word, which
is quite normal for texts of this type. There were a few subjects with
unusually short reading times. Eliminating their data did not change the
results. So from what we can tell, the subjects read the texts and did not
click through them. There is no information about reading times in the original
experiment. In fact, it would have been even better (for both the original
study and the replication attempt) to also have comprehension questions about
the passages at the end of the experiment.
Lesson 2: gather as
much information about your manipulation-inducing stimuli as possible.
Another potential problem, which was pointed out by a
commenter on the previous post, is that some subjects on Mechanical Turk,
“Turkers,” may already have participated in similar experiments and thus not be
naïve to the manipulation (see here
for a highly informative paper on this topic).
We always ask subjects about their perceptions of the
manipulation and this experiment is no exception. We coded a perception as
“aware of the manipulation” if it mentioned “honesty”, “integrity”, “pressing
the space bar,” “looking at the answer”, “following instructions,” or something
similar. We coded someone as “unaware” if they explicitly stated that they
had no idea or if the mentioned a different purpose of the experiment. Some
examples are: (1) The study was about
judgments and quickness, (2) Deterioration
of short term memory, and (3) How quickly
people can solve math problems.
According to these criteria, about half the subjects were
“aware” of the manipulation. We performed a separate follow-up analysis on the
“unaware” subjects. There still was no effect of the manipulation on the amount
of cheating. We did find a slightly higher number of incidences of cheating
among the “aware” subjects than on the “unaware” subjects. All in all, though,
the level of cheating was much lower than in the original study.
So does awareness of the manipulation explain our null
findings? I don’t think so. Some commenters on the previous post decried our
study for having so many “aware” subjects. They should realize that we don't even know if all 30 subjects in the original study believed the cover story; there is no information on this in the article.
Lesson 3: always ask
subjects about their perceptions of the purpose of the experiment.
I find it hard to
believe that the subjects in the original experiment all bought the cover story.
Unlike in our experiment, the original study has no information on how many people
disbelieved the cover story. Some commenters have suggested that it is easier
to convince people of the cover story if you have an actual experimenter. This
seems plausible although it still doesn't seem likely to me that everyone
would have believed the story. And of course it would be an awful case of circular
reasoning to say that the subjects must have believed the manipulation simply because there was a large effect.
But there is a bigger point. If the large effect reported in
the original study hinges on the acting skills of the experimenter, then there
should be information on this in the paper. The article merely states that the
subjects were told of the glitch. We
incorporated what the students were told
in our instruction. But if it is not the contents
of what they were told that is responsible for the effect but rather the manner in which it is told, then there should be information on
this. Did the experimenter act distraught, confused, embarrassed, or neutrally?
And was this performance believable and delivered with any consistency? If the
effect hinges on the acting skills of an experimenter, experimentation becomes an
art and not a science. In addition to voodoo
statistics, we would have voodoo experimentation. (A reader of this post pointed me to this highly relevant article on the ubiquity of voodoo practices in psychological research.)
It should be obvious but I’d like to state it explicitly anyway,
I’m not saying that V&S
performed voodoo experimentation. I am just saying that if the claim is that
the effect relies on factors that are not (or cannot be) articulated and documented—and
I’ve heard people (not V&S) make this claim—then we have voodoo
experimentation.
Lesson 4: Beware of
Voodoo Experimentation
It is striking that we were not even able to replicate the
manipulation check that V&S used. I was told by another researcher (who is
also performing a replication of the V&S experiment) that the reliability
of the original manipulation check is low (we had not thought to examine this, but we did use the updated version of this scale, the FAD-plus). I do not want to steal this researcher’s
thunder, and so will not say anything more about this issue at this point (I
will provide an update as soon as the evidence from the that researcher's experiment is
available). But the fact that we did not replicate the large effect on the
manipulation check that was reported in the original study might not count as a
strike against our replication attempt.
So where does this leave us? <p><span style="display:none">claimtoken-515be493dc514</span></p> The fact that the large (!) effect
of the original study completely evaporated in our experiment cannot be due to
(1) the age or education levels of the subjects, (2) subjects not reading the
manipulation-inducing passages (if reading times are any indicator), and (3)
subjects’ awareness of the manipulation. The original paper provides no
evidence regarding these issues.
The evaporation of the effect could, however, be due to (1),
the special nature of the sample of the original sample (2) the undocumented
acting skills of a real-life experimenter (voodoo experimentation), or of
course (3) the large effect being a false positive. I am leaning towards the third
option, although I would not find a small effect implausible (in fact, that is what I was initially expecting to find).
Nice discussion of an important issue. Methods are the parts of experiments that we can control, and we need to pay a lot more attention to the details. I wanted to emphasize one other point. Due to the nature of random sampling, we cannot easily rule out statistical errors. V&S might have made a Type I error. There is no shame in that, and it must happen some times. Likewise, the replication study might have been a Type II error, and there is no shame in that either.
BeantwoordenVerwijderenI think it is not possible to separate statistical errors and methodological differences without a theory. If your theory says that the methodological differences should not matter, then you should pool the experimental results together to get your best estimate of the strength of the effect.
If your theory says that the methodological differences do matter, then a lot more experimental work is required to demonstrate those differences.
I agree on all counts.
BeantwoordenVerwijderenIt will be great when researchers begin publishing more information about the methods. For example reaction times in the Qualtrics program are less reliable than RTs in programs designed for this purpose. Using one program or another is thus quite important but most people don't report such information (I don't think this matters so much for the RTs you report above, just a general example). Hopefully norms will change.
BeantwoordenVerwijderenWhat does the failed manipulation check mean for making inferences about the rest of the study? But perhaps you don't want to answer because of the thunder?
I think the failed manipulation check is the hardest thing to "explain away" due to differences in samples and procedures. But it also makes it hard to know how best to think about the primary effect.
As with the original voodoo paper, I think the title is over the top. Common sense suggests that some aspects of any given experiment would be more or less believable and taken more or less seriously in the lab vs. online. As I said before, this specific issue is especially about the DV and I don't think that it can easily account for the failed manipulation check.
Deze reactie is verwijderd door de auteur.
VerwijderenBlogs can have snappy titles; articles need informative ones: http://rolfzwaan.blogspot.nl/2013/01/overly-amusing-article-titles.html.
VerwijderenIt could be a another case of the dwindling effect-size phenomena
BeantwoordenVerwijderenhttp://ignoranceanduncertainty.wordpress.com/2011/01/10/disappearing-truths-or-vanishing-illusions/
Yes, and of course Jonathan Schooler, the second author of the free-will study, has done interesting and important work in this area.
BeantwoordenVerwijderenDeze reactie is verwijderd door de auteur.
VerwijderenWe ran it, submitted it to Psych Science. It was rejected (both original authors were among the reviewers). We will probably submit somewhere else but maybe I should also write a blog post about it. That second replication (run in the lab) also didn't show an effect (big surprise). One of the other reviewers wanted us to also replicate Vohs & Schooler's other experiment, which I found silly, especially after learning that there was an error in that experiment that was never corrected (the effect size is much smaller than reported in the paper).
VerwijderenThat's interesting, thank you for your reply. I'm particularly intrigued by the drop in cheating that you observed in both conditions, compared to the original results. I must admit, when reading your first post and before learning about the replication's outcome, I had a feeling the manipulation would go wrong once I read the experiment had been run online.
VerwijderenI'd be curious to know if the lab results approximated those from the online study at all?
We observed the same drop in the lab replication (Experiment 2) and also no effect. From the manuscript: "There is no difference in cheating between the AFW condition (M = 3.77, SD = 3.27) and the control condition (M = 2.93, SD = 3.99), t (58) = 0.88, p = .380, Cohen’s d = .23)."
VerwijderenWe observed the same drop in the lab replication (Experiment 2) and also no effect. From the manuscript: "There is no difference in cheating between the AFW condition (M = 3.77, SD = 3.27) and the control condition (M = 2.93, SD = 3.99), t (58) = 0.88, p = .380, Cohen’s d = .23)."
VerwijderenDeze reactie is verwijderd door de auteur.
BeantwoordenVerwijderen