Friday, October 24, 2014

ROCing the Boat: When Replication Hurts

Though failure to replicate presents a serious problem, even highly-replicable results may be consistently and dramatically misinterpreted if dependent measures are not carefully chosen. This sentence comes from a new paper by Caren Rotello, Evan Heit, and Chad DubĂ© to be published in Psychonomic Bulletin & Review. 

Replication hurts in such cases because it reinforces artifactual results. Rotello and colleagues marshal support for this claim from four disparate domains: eyewitness memory, deductive reasoning, social psychology, and studies of child welfare. In each of these domains researchers make the same mistake by using the same wrong dependent measure.

Common across these domains is that subjects have to make detection judgments: was something present or was it not present? For example, subjects in eyewitness memory experiments decide whether or not the suspect is in a lineup. There are four possibilities.
             Hit: The subject responds “yes” and the suspect is in the lineup.                     
             False alarm: The subject responds “yes” but the suspect is not in the lineup.  
             Miss: The subject responds “no” but the suspect is in the lineup.                      
             Correct rejection: Responds “no” and the suspect is not in the lineup.             

It is sufficient to only take the positive responses, hits and false alarms, into account if we want to determine decision accuracy (the negative responses are complementary to the positive ones). But the question is how we compute accuracy from hits and false alarms. And this is where Rotello and colleagues say that the literature has gone astray.

To see why, let’s continue with the lineup example. Lineups can be presented simultaneously (all faces at the same time) or sequentially (one face at a time). A meta-analysis involving data from 23 labs involving 13,143 participants concludes that sequential lineups are superior to simultaneous ones. Sequential lineups yield a 7.72 diagnosticity ratio and simultaneous ones only 5.78; in other words, sequential lineups are 1.34 (7.72/5.78) times more accurate than simultaneous ones. Rotello and colleagues mention that 32% of police precincts in the United States now use sequential lineups. They don’t state explicitly that this is because of the research but this is what they imply.

The diagnosticity ratio is computed by dividing the number of hits by the number of false alarms. Therefore, the higher the ratio, the better the detection rate. So the notion of sequential superiority rides on the assumption that the diagnosticity ratio is an appropriate measure of diagnosticity. Well, you might think, it has the word diagnosticity in it, so that’s at least a start. But Rotello and colleagues demonstrate, this may be all that it has going for it.

If you compute the ratio of hits and false alarms (or the difference between them, as is often done), you’re assuming a linear relation. The straight lines in Figure 1 connect all the hypothetical subjects who have the same diagnosticity ratio. So the lowest line here connects the subjects who are at chance performance, and thus have a diagnosticity ratio of 1 (# hits = # false alarms). The important point to note is that you get this ratio for a conservative responder with 5% hits and 5% false alarms but also for a liberal responder with 75% hits and 75% false alarms.

The lines in the figure are called Receiver Operating Characteristics (ROC). (So now you know what that ROC is doing in the title of this post.) ROC is a concept that was developed by engineers in World War II who were trying to improve ways to detect enemy objects in battlefields and then was introduced to the field of psychophysics. 

Now let’s look at some real data.The triangles in the figure represent data from an actual experiment (by Laura Mickes, Heather Flowe, and John Wixted) comparing simultaneous (open triangles) and sequential (closed triangles) lineups. Every point on these lines reflects the same accuracy but a different tendency to respond “yes.” The lines that you can fit through these data points will be curved. Rotello and colleagues note that curved ROCs are consistent with the empirical reality and straight lines assumed by the diagnosticity ratio are not.

Several large-scale studies have used ROCs rather than diagnosticity and found no evidence whatsoever for a sequential superiority effect in lineups. In fact, all of these studies found the opposite pattern: simultaneous was superior to sequential. So what is the problem with the diagnosticity ratio? As you might have guessed by now, it is that it does not control for response bias. Witnesses presented with a sequential lineup are just less likely to respond “yes I recognize the suspect” than witnesses presented with a simultaneous lineup. ROCs based on empirical data unconfound accuracy with response bias and show a simultaneous superiority effect.

Rotello and colleagues demonstrate convincingly that this same problem bedevils the other areas of research I mentioned at the beginning of this post but the broader point is clear. As they put it: This problem – of dramatically and consistently 'getting it wrong' – is potentially a bigger problem for psychologists than the replication crisis, because the errors can easily go undetected for long periods of time. Unless we are using the proper dependent measure, replications are even going to aggravate the problem by enshrining artifactual findings in the literature (all the examples discussed in the article are “textbook effects”). To use another military reference: in such cases massive replications will produce what in polite company is called a Charlie Foxtrot.

Rotello and colleagues conclude by considering the consequences of their analysis for ongoing replication efforts such as the Reproducibility Project and the first Registered Replication Report on verbal overshadowing that we are all so proud of. They refer to a submitted paper that argues the basic task in the verbal overshadowing experiment is flawed because it lacks a condition in which the perpetrator is not in the lineup. I haven’t read this study yet and so can’t say anything about it, but it sure will make for a great topic for a future post (although I’m already wondering whether I should start hiding under a ROC).

Rotello and colleagues have produced an illuminating analysis that invites us once more to consider how valid our replication attempts are. Last year, I had an enjoyable blog discussion about this very topic with Dan Simons, it even uses the verbal overshadowing project as an example. Here is a page with links to this diablog.

I thank Evan Heit for alerting me to the article and for feedback on a previous draft of this post.


  1. In case you're interested, the National Academy of Sciences in the USA just released a comprehensive report on eyewitness identification. The report discusses this issue extensively (both simultaneous/sequential lineup procedures and the use of ROC). You can access it at

    The Perspectives RRR on verbal overshadowing did not include a target absent condition, and the report discusses the consequences of that for interpreting the results. Specifically, the reduced accuracy in the verbal description condition could be due to impaired memory, or it could just reflect a reduced likelihood to select someone (criterion shift). The pattern of errors in the RRR (but not in the original report) are consistent with such a criterion shift.

    It's great to raise this issue - it's particularly important when the goal is to determine which of two procedures yields better discriminability.

  2. To clarify my comments on twitter: the broader point here is of course quite reasonable. But I have a negative response to sentences like this one: "This problem – of dramatically and consistently 'getting it wrong' – is potentially a bigger problem for psychologists than the replication crisis, because the errors can easily go undetected for long periods of time." (This is from the original paper, not from Rolf). To me this statement translates into something like "bad science worse than flaky science" and I'm not sure I can get behind that. Both are bad in their own way.

    As Rolf wrote on twitter: "we should be careful about what we replicate" - I think that's right, but we should also be careful about the methods we use more generally. The issue doesn't really seem to be primarily about replication, except in the sense that any bad method promotes bad conclusions.

    1. I agree - it's slightly disappointing people feel the need to position themselves in relation to the current replication debate. This issue is much more relevant in the majority of psychological studies that lack a standardized measurement method to begin with, which is an even bigger problem. To examine possible confounds, you first need to know the effect replicates to begin with, so it's important to keep our priorities straight.

    2. Is "bad science worse than flaky science"? We probably all have different answers to that, depending on how bad or flaky the work. In the examples we chose for our paper, I think the bad was pretty awful -- for two of the cases, the eyewitness ID situation Rolf blogged about and the belief bias effect, the bad science went on for at least 3 decades. That's a LOT of extremely consistent but misinterpreted data, which brought with it misdirected theorizing and grant dollars. Even worse, in the ID situation, psychologists made recommendations to law enforcement that led many precincts to switch to sequential lineups, a procedure that the ROCs suggest is inferior.

      Flaky science gets talked about at conferences, with hallway chats about failures to replicate and general sketicism. The misinterpreted science we wrote about appears in the main talks at conferences, because it replicates time and again (thus generating no skepticism), and too little attention is paid to the measures in use. On that final point, we definitely agree: we should be far more careful about our measures. We spent a lot of time worrying about experimental design, and not nearly enough about measurement.

    3. How much replication is needed before we should worry about confounds, bad measures, or poor designs? Is 30 years worth enough? That's what happened in two of the examples we discuss. Shouldn't we, as field, be smarter than that?

      Note that we are not suggesting replication is a bad idea - not at all. Our point is that there are effects that are highly replicable but still misunderstood, and that's a serious problem.

    4. This comment has been removed by the author.

    5. Daniel, as I see it the issue in this case is not a lack of standardized methods but the use of methods that are demonstrably wrong. Meta-analysis and replications of these studies are only going to aggravate the problem. Of course, if everyone had posted their raw data, these could have been reanalyzed without too much effort. A good example for the need for open science.

  3. This reminds me of the psychiatry researcher who made up his own metric, "Raw Specificity", when he realized that his test had poor Specificity...!

    1. Indeed a "shocking piece of statistics" as you say at the beginning of your post.