Thursday, April 3, 2014

Replicating Down vs. Replicating Up

More and more people are involved in replication research. This is a good thing.

Why conduct replication experiments? A major motivation for recent replication attempts appears to have been because there are serious doubts about certain findings. On that view, unsuccessful replications serve to reduce the initially observed effect size into oblivion. I call this replicating down. Meta-analytically speaking, the aggregate effect size becomes smaller with each replication attempt and confidence in the original finding will dwindle accordingly (or so we would like to think). But the original finding will not disappear from the literature.

 No, I'm not Noam Chomsky
Replicating down is definitely a useful endeavor but it can be quite discouraging. You’re conducting an experiment that you are convinced doesn’t make any sense at all. Suppose someone conducted a priming study inspired by a famous quote from Woody Allen’s Husbands and Wives: I can't listen to that much Wagner. I start getting the urge to conquer Poland. Subjects were primed with Wagner or a control composer (Debussy?) and then completed an Urge-to-Conquer-Poland scale. The researchers found that the urge-tot-conquer-Poland was much greater in the Wagner than in the Debussy condition (in that condition, however, people scored remarkably higher on the Desire-to-Walk-Around-with-Baguettes scale). The effect size was large, d=1. If you are going to replicate this and think the result is bogus, then you’re using valuable time and resources that could have been spent toward novel experiments. Plus you might feel silly performing the experiment. The whole enterprise might feel all the more discouraging because you are running the experiment with twice or more the number of subjects that were used in the original study: an exercise in futility but with double the effort.

Other replication attempts are conducted because replicators have at least some confidence in the original finding (and in the method that produced it) but want to establish how robust it is. We might call this replicating up. A successful replication attempt shores up the original finding by yielding similar results and providing a more robust estimate of the effect size. But how is this replicating up? Glad you asked. Up doesn’t mean enlarging the effect size but it means raising the confidence we can have in the effect.

So while replicating down is certainly a noble and useful enterprise, a case could be made for replicating up as well. A recent nice example appears in a special topics section of Frontiers in Cognition that I’m co-editing. My colleagues Peter Verkoeijen and Samantha Bouwmeester performed a replication of an experiment by Kornell and Bjork (2008) that was published in Psychological Science. This experiment compared spaced (or actually “interleaved”) and massed practice in learning painting styles. In the massed practice condition, subjects saw blocks of six paintings by the same artist. In the spaced condition, each block contained six paintings by six different artists. Afterwards, they participated in a recognition test. Intuitively you would think that massed practice would be more effective. Kornell and Bjork thought this initially, as do the subjects in the experiments. Kornell and Bjork were therefore surprised to find that interleaved practice was actually more effective.

Verkoeijen and Bouwmeester replicated one of Kornell and Bjork’s experiments. One difference from the original experiment, which was run in the lab, was that the replication was run on Mechanical Turk. However, given that several other replication projects had shown no major differences between MTurk  experiments and lab experiments, there was no reason to think the effect could not be found in an online experiment. As Verkoeijen and Bouwmeester note:

For one, nowhere in their original paper do Kornell and Bjork (2008) indicate that specific sample characteristics are required to obtain a spacing effect in inductive learning. Secondly, replicating the effect with a sample from a more heterogeneous population than the relatively homogeneous undergraduate population would constitute evidence for the robustness and generality of the spacing effect in inductive learning and, therefore, would rule out that the effect is restricted to a rather specific and narrow population.

To cut to the chase, the replication attempt was successful (read the paper for a thoughtful discussion on this). Just as in the original study, the replication found a significant benefit for interleaved over massed practice. The effect sizes for the two experiments were quite similar. As the authors put it:

Our results clearly buttress those of Kornell and Bjork (2008) and taken together they suggest that spacing is indeed beneficial in inductive learning.

This is a nice example of replicating up. Moreover, the experiment has now been brought to a platform (MTurk) where any researcher can easily and quickly run replication attempts.

It seems that I’ve basically sung the virtues of successful replication. After all, isn’t any successful replication an upward replication? Of course it is. But I’m not talking about the outcome of the replication project. I’m talking about the motivation for initiating it. Replicating down and replicating up are both useful but in the long run upward replication is going to prove more useful (and less frustrating).

Perhaps a top-tier of journals should be created for solid findings in psychology (see Lakens & Koole, 2012 for a similar proposal). This type of journal would only publish findings that have been thoroughly replicated. The fairest way to go about this would be to have the original authors as first authors and the replicators as co-authors. Rather than trying to remove nonreplicable findings from the literature via downward replication, upward replication basically creates a new level in the literature, entrance to which only can be gained via upward replication. 

(I thank Peter Verkoeijen for pointing me toward the Woody Allen quote)

[update April 22, 2014: in my next post I discuss a study that would be a good candidate for replicating up.]


  1. This semester I am teaching a graduate course on research methods for social psychology. The group project was to select recently published research that we could directly replicate in a short period of time (i.e., use MTurk). We also selected, out of many possibilities brought to the table, a set of findings from a study that we felt would be very likely to replicate (what you call replicating up). The authors of the original study discussed their methods/procedures/results very clearly, and conducted a 2x2 experimental design via MTurk. When we contacted the authors they very graciously shared their Qualtrics file with us so we could implement the exact materials they used. We collected 2x more data, and replicated the significant 2-way interactions on each of the outcome variables, but the pattern of the interaction was in the opposite direction! We double checked our coding of conditions :) To be fair to the original research we will share our results (and data files) with the original authors as well as run the study with another large sample. Anyway, I wanted to share our experience of attempting to "replicate up". It was a bit of an eye opener for the students and a very positive experience overall, particularly given the openness of the original authors. It is also likely for those attempting to "replicate down" to obtain empirical evidence consistent with the original findings.

  2. It is a very nice idea to integrate replication studies into our teaching. I do the same and have also found that it is a very rewarding enterprise for students. I am glad that the original authors were so helpful in your case.

    You're right that it is possible for down-replicators to unexpectedly replicate the original findings, which would be ironic.

  3. A side note:

    Intuitively you would think that massed practice would be more effective. Kornell and Bjork thought this initially, as do the subjects in the experiments. Kornell and Bjork were therefore surprised to find that interleaved practice was actually more effective.
    They should read the literature. It has been known for decades that variable practice hurts performance in the short term but benefits it in the long run, and variable practice always outperforms massed practice. The basic idea is that the former exposes you to a wider sample of the task space which increases the demands on you initially but provides you with a more robust experience of the task space in the longer run.

    1. It'd be interesting to hear to their response to this...