Direct replications are very useful, especially given the current state of our field. However, direct replications do have their limitations.
The other day, I was talking with my colleagues Samantha Bouwmeester and Peter Verkoeijen about the logistics of a direct replication that we have signed on to do for Perspectives on Psychological Science. We ended up agreeing that a direct replication informs us about the original finding but not so much about the theory that predicted it. We're obviously not the only ones who are aware of this limitation of direct replications, but here is the gist of our discussion infused with some of my afterthoughts.
We are scheduled to perform a direct replication of Jonathan Schooler’s verbal overshadowing effect. In the original study, subjects were shown a 30-second video clip of a bank robbery. Subsequently, they either wrote down a description of the robber’s face (experimental condition), or they listed the names of the capitals of American states (control condition). Then the subjects solved a crossword puzzle. Finally, they had to pick the bank robber’s face out of a line-up. The subjects in the experimental condition performed significantly worse than those in the control condition, an effect that was attributed to verbal overshadowing.
In this replication project we—along with several other groups—are following a protocol that is tailored after the original study. This makes perfect sense given that we are trying to replicate the original finding. The protocol requires researchers to test subjects between the ages of 18 and 25. They will be shown the same 30-second video clip as was shown in the original study. They will also be shown the same line-up pictures as in the original study. The experiment will be administered in person rather than online.
My colleagues and I wondered how many of these requirements are intrinsic to the theory. For example, the theory does not postulate that verbal overshadowing only occurs in 18-25 year olds. In fact, it would be bordering on the absurd to predict that a 25-year old will fall prey to verbal overshadowing whereas a 26-year old will not. Verbal overshadowing is a theory about different types of cognitive representations (verbal and visual) and the conditions under which they interfere with one another. So what do we buy by limiting the sample to a specific age group? It is clear that we are not testing the theory of verbal overshadowing, rather we are testing the reproducibility of the original finding and not whether the finding itself says something useful about the theory.
Let’s look at the protocol again. As I just said, the control condition (which, incidentally, was not described in the original study but is described in the protocol) is one in which subjects generate the capitals of American states. The idea behind the control condition evidently is to give the subjects something to do that involves retrieval from memory and language production, which is what they are assumed to do in the experimental condition as well.
But a nitpicker like me might argue that even if you find a difference between the experimental condition and the control condition, which the original study did and which the replication attempts might as well, this does not provide evidence of overshadowing. Perhaps it is merely the task of describing something—whatever it is— that is responsible for the effect and not the more specific task of describing the robber’s face. I’m not saying this is true but we won’t be able to rule it out empirically.
A better control condition might be one in which subjects are required to describe some target that was not in the video they just saw. For example, they could describe a famous landmark or the face of a celebrity. After all, the theory is not that describing per se is responsible for the effect. The theory is that describing the face that you’re supposed to recognize later from a line-up is going to interfere with your visual memory for that particular face.
So even if all of our replication attempts nicely converge on the finding that the control condition outperforms the experimental condition (and effect sizes are similar), this does not necessarily mean that we’ve strengthened the support for verbal overshadowing. It is still possible that a third condition in which people describe something else than the bank robber would also perform more poorly than the state capital condition. This would lead to the conclusion that simply describing something, anything really, causes verbal overshadowing.
So the question is what we want to achieve with replications. Replications as they are being (or about to be) performed right now—in Perspectives on Psychological Science (PoPS), the Open Science Framework, or elsewhere (e.g., here and here)—inform us about the reproducibility of specific empirical findings. In other words, they tell us about the reliability of those findings. They don’t tell us much about their validity. Direct replications largely have a meta-function by providing insight into the way we do experiments. It is extremely useful to conduct direct replications and I think the editors of PoPS have done an excellent job in laying out the rules of the direct replication game.
But let’s take a look at this picture that I stole from Wikipedia. Even if all replication attempts reproduce the original finding, we might be in a situation represented by the lower left panel. Sure, all of our experiments show similar effects but none have hit the bull’s eye. The findings are reliable but not valid. Where we want to be is in the bottom right panel where high reliability is coupled with high validity.
How do we get there? Here is an idea: by extending the protocol-based paradigm. For example, a protocol could be extracted from the work verbal overshadowing that is consensually viewed as the optimal or most incisive way to test this theory. This protocol might be like a script (of the Schank & Abelson kind) with slots for things like stimuli and subjects. We would then need to specify the criteria for how each slot should be filled.
We’d want the slots to be filled slightly differently across studies; this would prevent the effect from being attributable to quirks of the original stimuli and thus enhance the validity of our findings. To stick with verbal overshadowing, across studies we’d want to use different videos. We’d also want to use different line-ups. By specifying the constraints that stimuli and subjects need to meet we would end up with a better understanding of what the theory does and does not claim.
So while I am fully supportive of (and engaged in) direct replication efforts, I think it is also time to start thinking a bit more about validity in addition to reliability. In the end, we’re primarily interested in having strong theories.