Direct replications are very useful, especially given the current state of our field. However, direct replications do have their limitations.
The other day, I was talking with my colleagues Samantha Bouwmeester and Peter Verkoeijen about the
logistics of a direct replication that we have signed on to do for Perspectives
on Psychological Science. We ended up agreeing that a direct replication informs us about the original finding but not so much about the theory that predicted it. We're obviously not the only ones who are aware of this limitation of direct replications, but here is the gist of our discussion infused with some of my afterthoughts.
We are scheduled to perform a direct replication of Jonathan Schooler’s verbal overshadowing effect. In the original study,
subjects were shown a 30-second video clip of a bank robbery. Subsequently,
they either wrote down a description of the robber’s face (experimental
condition), or they listed the names of the capitals of American states
(control condition). Then the subjects solved a crossword puzzle. Finally, they
had to pick the bank robber’s face out of a line-up. The subjects in the
experimental condition performed significantly worse than those in the control
condition, an effect that was attributed to verbal overshadowing.
In this replication project we—along with several other
groups—are following a protocol
that is tailored after the original study. This makes perfect sense given that
we are trying to replicate the original finding. The protocol requires
researchers to test subjects between the ages of 18 and 25. They will be shown
the same 30-second video clip as was shown in the original study. They will
also be shown the same line-up pictures as in the original study. The
experiment will be administered in person rather than online.
My colleagues and I wondered how many of these requirements are intrinsic to the
theory. For example, the theory does not postulate that verbal overshadowing
only occurs in 18-25 year olds. In fact, it would be bordering on the absurd to
predict that a 25-year old will fall prey to verbal overshadowing whereas a 26-year
old will not. Verbal overshadowing is a theory about different types of
cognitive representations (verbal and visual) and the conditions under which
they interfere with one another. So what do we buy by limiting the sample to a
specific age group? It is clear that we are not testing the theory of verbal
overshadowing, rather we are testing the reproducibility of the original
finding and not whether the finding itself says something useful about the
theory.
Let’s look at the protocol again. As I just said, the
control condition (which, incidentally, was not described in the original study but is described in the protocol) is one in
which subjects generate the capitals of American states. The idea behind the
control condition evidently is to give the subjects something to do that
involves retrieval from memory and language production, which is what they are
assumed to do in the experimental condition as well.
But a nitpicker like me might argue that even if you find
a difference between the experimental condition and the control condition,
which the original study did and which the replication attempts might as well,
this does not provide evidence of overshadowing. Perhaps it is merely the task
of describing something—whatever it is— that is responsible for the effect and
not the more specific task of describing the robber’s face. I’m not saying this is
true but we won’t be able to rule it out empirically.
A better control condition might be one in which subjects
are required to describe some target that was not in the video they just saw. For example, they
could describe a famous landmark or the face of a celebrity. After all, the
theory is not that describing per se is responsible for the effect. The theory
is that describing the face that you’re supposed to recognize later from a
line-up is going to interfere with your visual memory for that particular face.
So even if all of our replication attempts nicely converge
on the finding that the control condition outperforms the experimental
condition (and effect sizes are similar), this does not necessarily mean that
we’ve strengthened the support for verbal overshadowing. It is still possible
that a third condition in which people describe something else than the bank
robber would also perform more poorly than the state capital condition. This
would lead to the conclusion that simply describing something, anything really,
causes verbal overshadowing.
So the question is what we want to achieve with
replications. Replications as they are being (or about to be) performed right now—in
Perspectives
on Psychological Science (PoPS), the Open Science Framework, or elsewhere
(e.g., here
and here)—inform
us about the reproducibility of specific empirical findings. In other words,
they tell us about the reliability of those findings. They don’t tell us much about
their validity. Direct replications largely have a meta-function by providing insight
into the way we do experiments. It is extremely useful to conduct direct
replications and I think the editors of PoPS
have done an excellent job in laying out the rules of the direct replication
game.
But let’s take a look at this picture that I stole from
Wikipedia. Even if all replication attempts reproduce the original finding, we
might be in a situation represented by the lower left panel. Sure, all of our
experiments show similar effects but none have hit the bull’s eye. The findings
are reliable but not valid. Where we want to be is in the bottom right panel where
high reliability is coupled with high validity.
How do we get there? Here is an idea: by extending the protocol-based
paradigm. For example, a protocol could be extracted from the work verbal
overshadowing that is consensually viewed as the optimal or most incisive way
to test this theory. This protocol might be like a script (of the Schank & Abelson kind) with slots for things like stimuli and subjects. We would then
need to specify the criteria for how each slot should be filled.
We’d want the slots to be filled slightly differently across
studies; this would prevent the effect from being attributable to quirks of the
original stimuli and thus enhance the validity of our findings. To stick with verbal
overshadowing, across studies we’d want to use different videos. We’d also want
to use different line-ups. By specifying the constraints that stimuli and
subjects need to meet we would end up with a better understanding of what the
theory does and does not claim.
So while I am fully supportive of (and engaged in) direct
replication efforts, I think it is also time to start thinking a bit more about
validity in addition to reliability. In the end, we’re primarily interested in
having strong theories.
Great subject, stimulating post. I'm really worried about blind replication for fMRI experiments, and for similar reasons as you've stated. You have already nicely introduced the control condition and whether it's the most appropriate way to do it. Then there are the manifold ways an fMRI experiment can be conducted sub-optimally. When an experiment runs north of $500/hr it doesn't seem entirely reasonable not to make the replication attempt the best possible experiment. Would we really want to introduce a sub-optimal parameter, say, specifically to perform a facsimile of the original experiment? Perhaps, but only with a lot of careful thought, e.g. if we were under the impression that a systematic acquisition error was the cause of the prior (questionable) result.
BeantwoordenVerwijderenI think your focus on validity is excellent, in particular when expensive or time-consuming methods are involved.
Thanks. You make a great point. The costs associated with fMRI experiments may be prohibitive for doing direct replications. Interesting how you them "blind replications" by the way. I hadn't thought of it this way but I can see how you might want to call them that in the context of expensive fMRI experiments. Of course, it makes you wonder whether such suboptimal and hugely expensive experiments should have been run in the first place.
BeantwoordenVerwijderenSo while I am fully supportive of (and engaged in) direct replication efforts, I think it is also time to start thinking a bit more about validity in addition to reliability. In the end, we’re primarily interested in having strong theories.
BeantwoordenVerwijderenOh God, yes please.
Actually this is a nice example of psychology's fascination with phenomena. We are a science of effects, things that happened, and we will make no real progress until we mature into a science of mechanism, with theories about what should happen. The fact that, say, social priming is so easily disrupted (leading to a mixed bag of successful and unsuccessful replications) is just an enormous hint that we as yet know nothing about why something like social priming might possibly happen.
I went on about this sort of problem here, it's been bugging me for years.
I agree that it looks like we are a science of effects (the Stroop effect, the Simon effect, the Deese effect, the bystander effect, the spacing effect). Perhaps the lack of interest in mechanisms is the most obvious in social priming; see my earlier posts on this topic.
VerwijderenIt looks like things are beginning to change though.
Interesting post. I agree that testing the validity of prior findings is theoretically more important than testing the reliability of prior findings. However, without reliability, there's no replicable substance to quibble about!
BeantwoordenVerwijderenIn your PoPS replication effort, why not just add the better control condition you describe as a completely separate third condition (n=50) in addition to the original two conditions described in the protocol (n=50 each, for a total N=150)??? That way, you could simultaneously contribute to both the reliability and validity questions!!
I'd thought of this as well. It's a great way to kill two birds with one stone. For this particular project, though, it might not be feasible for us given the size of our subject pool. We were going for n=120; I don't think we can run 180 subjects in the time given.
Verwijderen