Thursday, September 18, 2014

Verbal Overshadowing: What Can we Learn from the First APS Registered Replication Report?

Suppose you witnessed a heinous crime being committed right before your eyes. Suppose further that a few hours later, you’re being interrogated by hard-nosed detectives Olivia Benson and Odafin Tutuola. They ask you to describe the perpetrator. The next day, they call you in to the police station and present you with a lineup. Suppose the suspect is in the lineup. Will you be able to pick him out? A classic study in psychology suggest Benson and Tutuola have made a mistake by first having you describe the perpetrator because the very act of describing the perpetrator will make it more difficult for you to identify him out of the lineup.

This finding is known as the verbal overshadowing effect and was discovered by Jonathan Schooler. In the experiment that is of interest here, he and his co-author, Tonya Engstler-Schooler, found that verbally describing the perpetrator led to a 25% accuracy decrease in identifying him. This is a sizeable difference with practical implications. Based on these findings, we’d be right to tell Benson and Tutuola to lay off interviewing you until after the lineup identification.

Here is how the experiment worked.


Subjects first watched a 44 second video clip of a (staged) bank robbery. Then they performed a filler task for 20 minutes, after which their either wrote down a description of the robber (experimental condition) or listed names of US states and their capitals (control condition). After 5 minutes, they performed the lineup identification task.

How reliable is the verbal-overshadowing effect? That is the question that concerns us here. A 25% drop in accuracy seems considerable. Schooler himself observed subsequent research yielded progressively smaller effects, something he referred to as “the decline effect.” This clever move created a win-win situation for him. If the original finding replicates, the verbal overshadowing hypothesis is supported. If it doesn’t, then the decline effect hypothesis is supported.

The verbal overshadowing effect is the target of the first massive Registered Registration Report under the direction of Dan Simons (Alex Holcombe is leading the charge on the second project) that was just published. Thirty-one labs were involved in direct replications of the verbal overshadowing experiment I just described. Our lab was one of the 31. Due to the large number of participating labs and the laws of the alphabet, my curriculum vitae now boasts an article on which I am 92nd author.

Due to an error in the protocol, the initial replication attempt had the description task and  a filler task in the wrong order before the line-up task, which made the first set of replications, RRR1, a fairly direct replication of Schooler’s Experiment 4 rather than, as was the plan, his Experiment 1. A second set of experiments, RRR2, was performed to replicate Schooler’s Experiment 1. You see the alternative ordering here.

In Experiment 4, Schooler found that subjects in the verbal description condition were 22% less accurate than those in the control condition. A meta-analysis of the RRR1 experiments yielded a considerably smaller, but still significant, 4% deficit. Of note is that all the replication studies found a smaller effect than the original study but that study was also less precise due to having a smaller sample size.

Before I tell you about the results of the replication experiments I have a confession to make. I have always considered the concept of verbal overshadowing plausible, even though I might have a somewhat different explanation for it than Schooler (more about this maybe in a later post), but I thought the experiment we were going to replicate was rather weak. I had no confidence that we would find the effect. And indeed, in our lab, we did not obtain the effect. You might argue that this null effect was caused by the contagious skepticism I must have been oozing. But I did not run the experiment. In fact, I did not even interact about the experiment with the research assistant who ran it (no wonder I’m 92nd author on the paper!). So the experiment was well-insulated from my skepticism.

Let's get back on track. In Experiment 1, Schooler found a 25% deficit. The meta-analysis of RRR2 yielded a 16% deficit-- somewhat smaller but still in the same ballpark. Verbal overshadowing appears to be a robust effect. Also interesting is the finding that the position of the filler task in the sequence mattered. The verbal overshadowing effect is larger when the lineup identification immediately follows the description and when there is more time between the video and the description. In fact either of those or a combination of them could be responsible for this difference in effect sizes.

Here are the main points I take a away from this massive replication effort.

1. Our intuitions about effects may not be as good as we think. My intuitions were wrong because a meta-analysis of all the experiments finds strong support for the effect. Maybe I’m just a particularly ill-calibrated individual or an overly pessimistic worrywart but I doubt it. For one, I was right about our own experiment, which didn’t find the effect. At the same time, I was clearly wrong about the overall effect. This brings me to the second point.

2. One experiment does not an effect make (or break).  This goes both for the original experiment, which did find a big effect, as for our replication attempt (and 30 others). One experiment that shows an effect doesn’t mean much, and neither does one unsuccessful replication. We already knew this, of course, but the RRR drives this point home nicely.

3. RRRs are very useful for estimating effect sizes without having to worry about publication bias. But it should be noted that they are very costly. Using 31 labs seems was probably overkill, although it was nice to see all the enthusiasm for a replication project.

4. More power is better. As the article notes about the smaller effect in RRR1: “In fact, all of the confidence intervals for the individual replications in RRR1 included 0. Had we simply tallied the number of studies providing clear evidence for an effect […], we would have concluded in favor of a robust failure to replicate—a misleading conclusion. Moreover, our understanding of the size of the effect would not have improved."

5. Replicating an effect against your expectations is a joyous experience.  This sounds kind of sappy but it’s an accurate description of my feelings when I was told by Dan Simons about the outcome of the meta-analyses. Maybe I was biased because I liked the notion of verbal overshadowing but it is rewarding to see an effect materialize in a meta-analysis. It's a nice example of “replicating up.”

Where do we go from here? Now that we have a handle on the effect, it would be useful to perform coordinated and preregistered conceptual replications (using different stimuli, different situations, different tasks). I'd be happy to think along with anyone interested in such a project.

Update September 24, 2014. The post is the topic of a discussion on Reddit.


  1. Interestingly, your lab's result fell right in line with the meta-analytic mean for RRR1! .05 vs .04!

  2. Hi, Rolf,

    Congrats on this inspiring blog.

    I've got teased to find out about this effect, and even more where you hint at a different explanation for it. Below is how it strikes me at first sight; then, would you please tell me if that was also Schooler's account (though roughly fit), or if it fits your own better?

    Participants describing the criminal prior to the lineup seem to be betrayed by their own limited lexicon, to the extent that their original perception of the criminal becomes cropped up, simplified, on the basis of their verbal description. Indeed, we all have limited lexicons, which won't come close to what our senses can take in; yet the latter are annoyingly 'retro-plastic' at times.

    Thank you in advance!