Though failure to
replicate presents a serious problem, even highly-replicable results may be
consistently and dramatically misinterpreted if dependent measures are not
carefully chosen. This sentence comes from a new paper by Caren Rotello, Evan Heit, and Chad Dubé to be published in Psychonomic
Bulletin & Review.
Replication hurts in such cases because it reinforces artifactual
results. Rotello and colleagues
marshal support for this claim from four disparate domains: eyewitness memory, deductive reasoning,
social psychology, and studies of child welfare. In each of these domains
researchers make the same mistake by using the same wrong dependent measure.
Common across these domains is that subjects have to make detection
judgments: was something present or was it not present? For example, subjects in eyewitness
memory experiments decide whether or not the suspect is in a
lineup. There are four possibilities.
Hit: The subject responds
“yes” and the suspect is in the lineup.
False alarm: The
subject responds “yes” but the suspect is not in the lineup.
Miss: The subject responds
“no” but the suspect is in the lineup.
Correct rejection: Responds
“no” and the suspect is not in the lineup.
It is sufficient to only take the positive responses, hits
and false alarms, into account if we want to determine decision accuracy (the
negative responses are complementary to the positive ones). But the question is
how we compute accuracy from hits and false alarms. And this is where Rotello
and colleagues say that the literature has gone astray.
To see why, let’s continue with the lineup example. Lineups can
be presented simultaneously (all faces at the same time) or sequentially (one
face at a time). A meta-analysis involving data from 23 labs involving 13,143 participants concludes that sequential lineups are superior to simultaneous ones.
Sequential lineups yield a 7.72 diagnosticity
ratio and simultaneous ones only 5.78; in other words, sequential lineups
are 1.34 (7.72/5.78) times more accurate than simultaneous ones. Rotello and colleagues mention that 32% of police precincts in the United States now use sequential lineups. They don’t state explicitly that this is because of the research but this is what they imply.
The diagnosticity ratio is computed by dividing the number
of hits by the number of false alarms. Therefore, the higher the ratio, the
better the detection rate. So the notion of sequential superiority rides on the assumption
that the diagnosticity ratio is an appropriate measure of diagnosticity. Well, you might think, it has the word diagnosticity in it, so
that’s at least a start. But Rotello and colleagues demonstrate, this may
be all that it has going for it.
If you compute the ratio of hits and false alarms (or the
difference between them, as is often done), you’re assuming a linear relation. The
straight lines in Figure 1 connect all the hypothetical subjects who have the
same diagnosticity ratio. So the lowest line here connects the subjects who are at chance performance, and thus have a diagnosticity
ratio of 1 (# hits = # false alarms). The important point to note is that you get this ratio for a
conservative responder with 5% hits and 5% false alarms but also for a
liberal responder with 75% hits and 75% false alarms.
The lines in the figure are called Receiver Operating Characteristics
(ROC). (So now you know what that ROC is doing in the title of this post.) ROC is a concept that was developed by engineers
in World War II who were trying to improve ways to detect enemy objects in
battlefields and then was introduced to the field of psychophysics.
Now let’s look at some real data.The triangles in the figure represent
data from an actual experiment (by Laura Mickes, Heather Flowe, and John Wixted) comparing simultaneous (open triangles) and sequential (closed triangles) lineups.
Every point on these lines reflects the same accuracy but a different tendency
to respond “yes.” The lines that you can fit through these data points will be curved. Rotello and colleagues
note that curved ROCs are consistent with the empirical reality and straight
lines assumed by the diagnosticity ratio are not.
Several large-scale studies have used ROCs rather than
diagnosticity and found no evidence whatsoever for a sequential superiority effect
in lineups. In fact, all of these
studies found the opposite pattern: simultaneous was superior to sequential. So
what is the problem with the diagnosticity ratio? As you might have guessed by
now, it is that it does not control for response bias. Witnesses presented with
a sequential lineup are just less likely to respond “yes I recognize the
suspect” than witnesses presented with a simultaneous lineup. ROCs based on empirical data unconfound accuracy with response bias and show a simultaneous superiority effect.
Rotello and colleagues demonstrate convincingly that this same
problem bedevils the other areas of research I mentioned at the beginning of this post but the broader point is clear. As
they put it: This problem – of
dramatically and consistently 'getting it wrong' – is potentially a bigger
problem for psychologists than the replication crisis, because the errors can
easily go undetected for long periods of time. Unless we are using the proper dependent measure,
replications are even going to aggravate the problem by enshrining artifactual
findings in the literature (all the examples discussed in the article are
“textbook effects”). To use another military reference: in such cases massive
replications will produce what in polite company is called a Charlie
Foxtrot.
Rotello and colleagues conclude by considering the consequences of
their analysis for ongoing replication efforts such as the Reproducibility Project and the
first Registered
Replication Report on verbal overshadowing that we are all so proud of. They
refer to a submitted paper that argues the basic task in the verbal
overshadowing experiment is flawed because it lacks a condition in which the
perpetrator is not in the lineup. I haven’t read this study yet and so can’t
say anything about it, but it sure will make for a great topic for a future
post (although I’m already wondering whether I should start hiding under a ROC).
Rotello and colleagues have produced an illuminating
analysis that invites us once more to consider how
valid our replication attempts are. Last year, I had an enjoyable blog
discussion about this very topic with Dan Simons, it even uses the verbal
overshadowing project as an example. Here is a page with links to this diablog.
I thank Evan Heit for
alerting me to the article and for feedback on a previous draft of this post.
In case you're interested, the National Academy of Sciences in the USA just released a comprehensive report on eyewitness identification. The report discusses this issue extensively (both simultaneous/sequential lineup procedures and the use of ROC). You can access it at http://www.nap.edu/catalog.php?record_id=18891&utm_expid=4418042-5.krRTDpXJQISoXLpdo-1Ynw.0&utm_referrer=http%3A%2F%2Fwww.nap.edu%2F
BeantwoordenVerwijderenThe Perspectives RRR on verbal overshadowing did not include a target absent condition, and the report discusses the consequences of that for interpreting the results. Specifically, the reduced accuracy in the verbal description condition could be due to impaired memory, or it could just reflect a reduced likelihood to select someone (criterion shift). The pattern of errors in the RRR (but not in the original report) are consistent with such a criterion shift.
It's great to raise this issue - it's particularly important when the goal is to determine which of two procedures yields better discriminability.
To clarify my comments on twitter: the broader point here is of course quite reasonable. But I have a negative response to sentences like this one: "This problem – of dramatically and consistently 'getting it wrong' – is potentially a bigger problem for psychologists than the replication crisis, because the errors can easily go undetected for long periods of time." (This is from the original paper, not from Rolf). To me this statement translates into something like "bad science worse than flaky science" and I'm not sure I can get behind that. Both are bad in their own way.
BeantwoordenVerwijderenAs Rolf wrote on twitter: "we should be careful about what we replicate" - I think that's right, but we should also be careful about the methods we use more generally. The issue doesn't really seem to be primarily about replication, except in the sense that any bad method promotes bad conclusions.
I agree - it's slightly disappointing people feel the need to position themselves in relation to the current replication debate. This issue is much more relevant in the majority of psychological studies that lack a standardized measurement method to begin with, which is an even bigger problem. To examine possible confounds, you first need to know the effect replicates to begin with, so it's important to keep our priorities straight.
VerwijderenIs "bad science worse than flaky science"? We probably all have different answers to that, depending on how bad or flaky the work. In the examples we chose for our paper, I think the bad was pretty awful -- for two of the cases, the eyewitness ID situation Rolf blogged about and the belief bias effect, the bad science went on for at least 3 decades. That's a LOT of extremely consistent but misinterpreted data, which brought with it misdirected theorizing and grant dollars. Even worse, in the ID situation, psychologists made recommendations to law enforcement that led many precincts to switch to sequential lineups, a procedure that the ROCs suggest is inferior.
VerwijderenFlaky science gets talked about at conferences, with hallway chats about failures to replicate and general sketicism. The misinterpreted science we wrote about appears in the main talks at conferences, because it replicates time and again (thus generating no skepticism), and too little attention is paid to the measures in use. On that final point, we definitely agree: we should be far more careful about our measures. We spent a lot of time worrying about experimental design, and not nearly enough about measurement.
How much replication is needed before we should worry about confounds, bad measures, or poor designs? Is 30 years worth enough? That's what happened in two of the examples we discuss. Shouldn't we, as field, be smarter than that?
VerwijderenNote that we are not suggesting replication is a bad idea - not at all. Our point is that there are effects that are highly replicable but still misunderstood, and that's a serious problem.
Deze reactie is verwijderd door de auteur.
VerwijderenDaniel, as I see it the issue in this case is not a lack of standardized methods but the use of methods that are demonstrably wrong. Meta-analysis and replications of these studies are only going to aggravate the problem. Of course, if everyone had posted their raw data, these could have been reanalyzed without too much effort. A good example for the need for open science.
VerwijderenThis reminds me of the psychiatry researcher who made up his own metric, "Raw Specificity", when he realized that his test had poor Specificity...!
BeantwoordenVerwijderenIndeed a "shocking piece of statistics" as you say at the beginning of your post.
Verwijderen