More and more people are involved in replication research.
This is a good thing.
Why conduct replication experiments? A major motivation for
recent replication attempts appears to have been because there are serious doubts
about certain findings. On that view, unsuccessful replications serve to reduce
the initially observed effect size into oblivion. I call this replicating down. Meta-analytically
speaking, the aggregate effect size becomes smaller with each replication
attempt and confidence in the original finding will dwindle accordingly (or so we
would like to think). But the original finding will not disappear from the
literature.
No, I'm not Noam Chomsky |
Replicating down is definitely a useful endeavor but it can
be quite discouraging. You’re conducting an experiment that you are convinced
doesn’t make any sense at all. Suppose someone conducted a priming study inspired
by a famous quote from Woody Allen’s Husbands
and Wives: I can't listen to that
much Wagner. I start getting the urge to conquer Poland. Subjects were
primed with Wagner or a control composer (Debussy?) and then completed an Urge-to-Conquer-Poland scale. The researchers
found that the urge-tot-conquer-Poland was much greater in the Wagner than in
the Debussy condition (in that condition, however, people scored remarkably higher
on the Desire-to-Walk-Around-with-Baguettes
scale). The effect size was large, d=1.
If you are going to replicate this and think the result is bogus, then you’re
using valuable time and resources that could have been spent toward novel
experiments. Plus you might feel silly performing the experiment. The whole
enterprise might feel all the more discouraging because you are running the
experiment with twice or more the number of subjects that were used in the
original study: an exercise in futility but with double the effort.
Other replication attempts are conducted because replicators
have at least some confidence in the original finding (and in the method that
produced it) but want to establish how robust it is. We might call this replicating up. A successful replication
attempt shores up the original finding by yielding similar results and providing
a more robust estimate of the effect size. But how is this replicating up? Glad you asked. Up doesn’t mean enlarging the effect size but it means raising the
confidence we can have in the effect.
So while replicating
down is certainly a noble and useful enterprise, a case could be made for replicating up as well. A recent nice
example appears in a special topics section of Frontiers in Cognition that I’m co-editing. My colleagues Peter
Verkoeijen and Samantha Bouwmeester performed a replication
of an experiment by Kornell
and Bjork (2008) that was published in Psychological
Science. This experiment compared spaced (or actually “interleaved”) and
massed practice in learning painting styles. In the massed practice condition,
subjects saw blocks of six paintings by the same artist. In the spaced condition, each block contained
six paintings by six different artists. Afterwards, they participated in a
recognition test. Intuitively you would think that massed practice would be
more effective. Kornell and Bjork thought this initially, as do the subjects in
the experiments. Kornell and Bjork were therefore surprised to find that
interleaved practice was actually more effective.
Verkoeijen and Bouwmeester replicated one of Kornell and
Bjork’s experiments. One difference from the original experiment, which was run
in the lab, was that the replication was run on Mechanical Turk. However, given
that several other replication projects had shown no major differences between
MTurk experiments and lab experiments,
there was no reason to think the effect could not be found in an online
experiment. As Verkoeijen and Bouwmeester note:
For one, nowhere in their original paper do Kornell and Bjork (2008)
indicate that specific sample characteristics are required to obtain a spacing
effect in inductive learning. Secondly, replicating the effect with a sample
from a more heterogeneous population than the relatively homogeneous
undergraduate population would constitute evidence for the robustness and
generality of the spacing effect in inductive learning and, therefore, would
rule out that the effect is restricted to a rather specific and narrow
population.
To cut to the chase, the replication attempt was successful
(read the paper for a thoughtful discussion on this). Just as in the original
study, the replication found a significant benefit for interleaved over massed
practice. The effect sizes for the two experiments were quite similar. As the
authors put it:
Our results clearly buttress those of Kornell and Bjork (2008) and
taken together they suggest that spacing is indeed beneficial in inductive
learning.
This is a nice example of replicating up. Moreover, the experiment
has now been brought to a platform (MTurk) where any researcher can easily and
quickly run replication attempts.
It seems that I’ve basically sung the virtues of successful
replication. After all, isn’t any successful replication an upward replication?
Of course it is. But I’m not talking about the outcome of the replication
project. I’m talking about the motivation for initiating it. Replicating down
and replicating up are both useful but in the long run upward replication is
going to prove more useful (and less frustrating).
Perhaps a top-tier of journals should be created for solid
findings in psychology (see Lakens & Koole, 2012
for a similar proposal). This type of journal would only publish findings that
have been thoroughly replicated. The fairest way to go about this would be to
have the original authors as first authors and the replicators as co-authors. Rather
than trying to remove nonreplicable findings from the literature via downward replication, upward replication basically creates a
new level in the literature, entrance to which only can be gained via upward
replication.
(I thank Peter Verkoeijen for pointing me toward the Woody Allen quote)
[update April 22, 2014: in my next post I discuss a study that would be a good candidate for replicating up.]
[update April 22, 2014: in my next post I discuss a study that would be a good candidate for replicating up.]
This semester I am teaching a graduate course on research methods for social psychology. The group project was to select recently published research that we could directly replicate in a short period of time (i.e., use MTurk). We also selected, out of many possibilities brought to the table, a set of findings from a study that we felt would be very likely to replicate (what you call replicating up). The authors of the original study discussed their methods/procedures/results very clearly, and conducted a 2x2 experimental design via MTurk. When we contacted the authors they very graciously shared their Qualtrics file with us so we could implement the exact materials they used. We collected 2x more data, and replicated the significant 2-way interactions on each of the outcome variables, but the pattern of the interaction was in the opposite direction! We double checked our coding of conditions :) To be fair to the original research we will share our results (and data files) with the original authors as well as run the study with another large sample. Anyway, I wanted to share our experience of attempting to "replicate up". It was a bit of an eye opener for the students and a very positive experience overall, particularly given the openness of the original authors. It is also likely for those attempting to "replicate down" to obtain empirical evidence consistent with the original findings.
BeantwoordenVerwijderenIt is a very nice idea to integrate replication studies into our teaching. I do the same and have also found that it is a very rewarding enterprise for students. I am glad that the original authors were so helpful in your case.
BeantwoordenVerwijderenYou're right that it is possible for down-replicators to unexpectedly replicate the original findings, which would be ironic.
A side note:
BeantwoordenVerwijderenIntuitively you would think that massed practice would be more effective. Kornell and Bjork thought this initially, as do the subjects in the experiments. Kornell and Bjork were therefore surprised to find that interleaved practice was actually more effective.
They should read the literature. It has been known for decades that variable practice hurts performance in the short term but benefits it in the long run, and variable practice always outperforms massed practice. The basic idea is that the former exposes you to a wider sample of the task space which increases the demands on you initially but provides you with a more robust experience of the task space in the longer run.
It'd be interesting to hear to their response to this...
Verwijderen