ROCing the Boat: When Replication Hurts

Though failure to replicate presents a serious problem, even highly-replicable results may be consistently and dramatically misinterpreted if dependent measures are not carefully chosen. This sentence comes from a new paper by Caren Rotello, Evan Heit, and Chad Dubé to be published in Psychonomic Bulletin & Review.

Replication hurts in such cases because it reinforces artifactual results. Rotello and colleagues marshal support for this claim from four disparate domains: eyewitness memory, deductive reasoning, social psychology, and studies of child welfare. In each of these domains researchers make the same mistake by using the same wrong dependent measure.

Common across these domains is that subjects have to make detection judgments: was something present or was it not present? For example, subjects in eyewitness memory experiments decide whether or not the suspect is in a lineup. There are four possibilities.

Hit: The subject responds “yes” and the suspect is in the lineup.

False alarm: The subject responds “yes” but the suspect is not in the lineup.

Miss: The subject responds “no” but the suspect is in the lineup.

Correct rejection: Responds “no” and the suspect is not in the lineup.

It is sufficient to only take the positive responses, hits and false alarms, into account if we want to determine decision accuracy (the negative responses are complementary to the positive ones). But the question is how we compute accuracy from hits and false alarms. And this is where Rotello and colleagues say that the literature has gone astray.

To see why, let’s continue with the lineup example. Lineups can be presented simultaneously (all faces at the same time) or sequentially (one face at a time). A meta-analysis involving data from 23 labs involving 13,143 participants concludes that sequential lineups are superior to simultaneous ones. Sequential lineups yield a 7.72 diagnosticity ratio and simultaneous ones only 5.78; in other words, sequential lineups are 1.34 (7.72/5.78) times more accurate than simultaneous ones. Rotello and colleagues mention that 32% of police precincts in the United States now use sequential lineups. They don’t state explicitly that this is because of the research but this is what they imply.

The diagnosticity ratio is computed by dividing the number of hits by the number of false alarms. Therefore, the higher the ratio, the better the detection rate. So the notion of sequential superiority rides on the assumption that the diagnosticity ratio is an appropriate measure of diagnosticity. Well, you might think, it has the word diagnosticity in it, so that’s at least a start. But Rotello and colleagues demonstrate, this may be all that it has going for it.

If you compute the ratio of hits and false alarms (or the difference between them, as is often done), you’re assuming a linear relation. The straight lines in Figure 1 connect all the hypothetical subjects who have the same diagnosticity ratio. So the lowest line here connects the subjects who are at chance performance, and thus have a diagnosticity ratio of 1 (# hits = # false alarms). The important point to note is that you get this ratio for a conservative responder with 5% hits and 5% false alarms but also for a liberal responder with 75% hits and 75% false alarms.

The lines in the figure are called Receiver Operating Characteristics (ROC). (So now you know what that ROC is doing in the title of this post.) ROC is a concept that was developed by engineers in World War II who were trying to improve ways to detect enemy objects in battlefields and then was introduced to the field of psychophysics.

Now let’s look at some real data.The triangles in the figure represent data from an actual experiment (by Laura Mickes, Heather Flowe, and John Wixted) comparing simultaneous (open triangles) and sequential (closed triangles) lineups. Every point on these lines reflects the same accuracy but a different tendency to respond “yes.” The lines that you can fit through these data points will be curved. Rotello and colleagues note that curved ROCs are consistent with the empirical reality and straight lines assumed by the diagnosticity ratio are not.

Several large-scale studies have used ROCs rather than diagnosticity and found no evidence whatsoever for a sequential superiority effect in lineups. In fact, all of these studies found the opposite pattern: simultaneous was superior to sequential. So what is the problem with the diagnosticity ratio? As you might have guessed by now, it is that it does not control for response bias. Witnesses presented with a sequential lineup are just less likely to respond “yes I recognize the suspect” than witnesses presented with a simultaneous lineup. ROCs based on empirical data unconfound accuracy with response bias and show a simultaneous superiority effect.

Rotello and colleagues demonstrate convincingly that this same problem bedevils the other areas of research I mentioned at the beginning of this post but the broader point is clear. As they put it: This problem – of dramatically and consistently 'getting it wrong' – is potentially a bigger problem for psychologists than the replication crisis, because the errors can easily go undetected for long periods of time. Unless we are using the proper dependent measure, replications are even going to aggravate the problem by enshrining artifactual findings in the literature (all the examples discussed in the article are “textbook effects”). To use another military reference: in such cases massive replications will produce what in polite company is called a Charlie Foxtrot.

Rotello and colleagues conclude by considering the consequences of their analysis for ongoing replication efforts such as the Reproducibility Project and the first Registered Replication Report on verbal overshadowing that we are all so proud of. They refer to a submitted paper that argues the basic task in the verbal overshadowing experiment is flawed because it lacks a condition in which the perpetrator is not in the lineup. I haven’t read this study yet and so can’t say anything about it, but it sure will make for a great topic for a future post (although I’m already wondering whether I should start hiding under a ROC).

Rotello and colleagues have produced an illuminating analysis that invites us once more to consider how valid our replication attempts are. Last year, I had an enjoyable blog discussion about this very topic with Dan Simons, it even uses the verbal overshadowing project as an example. Here is a page with links to this diablog.

I thank Evan Heit for alerting me to the article and for feedback on a previous draft of this post.

Reacties

Daniel J. Simons24 oktober 2014 om 20:33
In case you're interested, the National Academy of Sciences in the USA just released a comprehensive report on eyewitness identification. The report discusses this issue extensively (both simultaneous/sequential lineup procedures and the use of ROC). You can access it at http://www.nap.edu/catalog.php?record_id=18891&utm_expid=4418042-5.krRTDpXJQISoXLpdo-1Ynw.0&utm_referrer=http%3A%2F%2Fwww.nap.edu%2F

The Perspectives RRR on verbal overshadowing did not include a target absent condition, and the report discusses the consequences of that for interpreting the results. Specifically, the reduced accuracy in the verbal description condition could be due to impaired memory, or it could just reflect a reduced likelihood to select someone (criterion shift). The pattern of errors in the RRR (but not in the original report) are consistent with such a criterion shift.

It's great to raise this issue - it's particularly important when the goal is to determine which of two procedures yields better discriminability.
BeantwoordenVerwijderen
Reacties
Michael Frank24 oktober 2014 om 22:37
To clarify my comments on twitter: the broader point here is of course quite reasonable. But I have a negative response to sentences like this one: "This problem – of dramatically and consistently 'getting it wrong' – is potentially a bigger problem for psychologists than the replication crisis, because the errors can easily go undetected for long periods of time." (This is from the original paper, not from Rolf). To me this statement translates into something like "bad science worse than flaky science" and I'm not sure I can get behind that. Both are bad in their own way.

As Rolf wrote on twitter: "we should be careful about what we replicate" - I think that's right, but we should also be careful about the methods we use more generally. The issue doesn't really seem to be primarily about replication, except in the sense that any bad method promotes bad conclusions.
BeantwoordenVerwijderen
Reacties
Neuroskeptic25 oktober 2014 om 14:56
This reminds me of the psychiatry researcher who made up his own metric, "Raw Specificity", when he realized that his test had poor Specificity...!
BeantwoordenVerwijderen
Reacties

Reactie toevoegen

Drang naar Samenhang

Zoeken in deze blog

ROCing the Boat: When Replication Hurts

Reacties

Een reactie posten