Monday, March 21, 2016

Truth in Advertising

As I indicated in my previous post, it is not easy to estimate beforehand to which situations your conclusions generalize. But it is important to at least  make an effort. Often, conclusions are wildly oversold, creating a paradoxical situation when the results are not replicated. Usually, a hidden moderator is invoked and all of a sudden the scope of conclusions previously advertised as far-reaching is drastically narrowed. An example of this arrived in my mailbox the other day.

A while ago, our registered replication report of Hart & Albarracin (2011) came out. I already blogged about it. The first author, William Hart, has now written a response to our report. It will appear in the next issue of Perspectives on Psychological Science; I have an advanced copy of the response.

Hart doesn’t raise substantive concerns about our report but he does suggest that maybe we didn’t replicate the original findings because the original study was run in Florida; he doesn't specify where, but I'm assuming Gainesville. Most of the studies in our replication were conducted with what he views as more liberal samples.

A number of issues are relevant here. For example, how strong were the original findings in the first place? Another question is how predictive the conservativeness of a county is of the conservativeness of the student body at a university situated in that county. Large universities attract students from all over the country and the way I understand it--political scientists might want to correct me here--students typically vote in their home state. 

I'm going to ignore these questions here because this post is about truth in advertising. Did the original study warn us that the conclusions would only hold for conservative samples? The answer is simple. No, it didn’t. All we get is this.

This doesn't even tell us the university that these 48 students were from, let alone how conservative they are. At least we now have Hart's response, which has narrowed our location down to the Sunshine State. Clearly, at the time the authors didn't think the geographical location of the student sample (let alone its conservativeness) was worth mentioning.

But the discrepancy between truth and advertising is even larger. Here are the article's conclusions.

Rather than alerting the reader that the effect is limited to conservative student samples, this statement suggests that the findings might generalize from the lab to the courtroom!

Am I being fair here? After all, isn't it normal scientific progress when later research finds the limitations of earlier findings? Yep, that's true. But I'm not so sure this is the case here. For one, in his response Hart doesn't provide any evidence that the finding replicates in a conservative sample--he merely offers the suggestion. 

And there also is this question. Does it make sense to generalize from findings with p-values of .01, .03, .02, obtained with a small (N=48 in a between-subjects design) sample, and a single vignette to courtroom behavior?

Rather than using the discussion section for overgeneralizations, it makes more sense to use it for specifying the situations under which the conclusions can be expected to hold. Not only does this provide more truth in advertising but it's also an important theoretical exercise. It's not easy, though.

I plan to pursue the topic of calibrating our conclusions in future posts. Now please excuse me while I try to assemble a Billy, Hemnes, Klippan, Poäng, or Bestå.

Thursday, March 10, 2016

Specifying the Generalizability of Our Conclusions

The other day, the economist Andreas Ortmann kindly put in a plug for my latest blog post on Facebook. This elicited a response from another researcher, who said:
One of my friends tried to replicate one of Rolf Zwaan's findings on verb aspect and failed ages ago, except they did it in Cantonese and not Dutch. This isn't a perfect replication, but if you read the conclusions, then it should be. So part of the problem isn't even the strength of the data, it's also the fact that being over-certain and over-generalizing conclusions is the standard way psychology papers are written.
The commenter then helpfully provided this link to the article (with which I was not familiar). I found the comment interesting, even though it incorrectly states that our study (see also this post from January) was conducted in Dutch. (You can’t even perform the study in Dutch due to grammatical differences with English.) More importantly, though, the claim that our finding was not replicated in Cantonese is also incorrect in a significant way. More about this in a minute.

The most important aspect of the comment is its relevance to the current discussion about replicability, (hidden) moderators, and specifying the generalizability of effects. So let's consider this more closely.

The comment states that psychology studies routinely overgeneralize conclusions. I wholeheartedly agree with this sentiment. In fact, I'd already been thinking to devote a blog post to this and will do so in the near future. However, I happen to think our article was a particularly good example of this hyperbolic tendency. If anything, our conclusion was rather modest (if not tepid).
The topic of verb aspect must be empirically addressed in future research so that it can yield a better understanding of how the imperfective and perfective aspects affect situation models during comprehension. The present study furthers our understanding of how subtle grammatical cues such as verb aspect influence the representations formed when we read or hear language. However, this is only the beginning of our pursuit to understand how situation models are constructed from our complicated linguistic code.
Everyone will agree that we didn’t exactly go out on a limb here. We didn’t state explicitly that we thought our findings would extend to other languages, such as Cantonese (a language that we have no knowledge of). On the other hand, we also didn’t stipulate that the conclusions would be restricted to English.

The question that concerns us here is what the proper way to state the limitations of the conclusions would have been. One extreme would be to provide a list of languages for which one would expect the effect to replicate. However, this is humanly impossible. Nobody knows all nearly 7000 languages in the world. The other extreme would be to state that the effect would only be expected to replicate in English, but this also presupposes intimate knowledge of all the world’s other languages. A more realistic option would be to state that one expects the effect to replicate in English and to possibly extend to other languages with similar aspectual systems. It might have been good to add this specification but we didn’t think to do so. The final option would be to say nothing, which is what we did, relying on the reader to infer that we are agnostic with regard to whether the results will replicate in other languages.

Over to the replicability of the finding. Our target finding, called the “perfective-advantage effect” (read the papers if you’re interested in the details) was indeed replicated:
The results from both experiments 1a and 1b show that there is a perfective advantage with accomplishment verbs. This advantage is robust across two different types of perfective aspect markers in Cantonese, zo2 and jyun4.”, p. 2413.
Yee-haw! We've made inroads into Cantonese. Our first step toward world domination. India, you’re next!

Hold your horses, pardner! The researchers observed that thus far researchers (apparently our finding was replicated in other Asian studies as well) had only looked at one type of verb, called “accomplishments” in Zeno Vendler’s taxonomy. Accomplishments are actions that have an endpoint and that are incremental or gradual. An example of an accomplishment is painting a picture. The action is finished when the picture is completed.

Would the conclusions generalize to other verb types? And here the researchers showed that the perfective advantage reverses to an imperfective advantage (in Cantonese at least), with another verb type namely “activities,” which do not have an endpoint, e.g., run.

So did our effects replicate? Yes, there is a conceptual replication in a different language, Cantonese, with the same verb type and the same task. And no, the effect flips (in Cantonese at least), with a different verb type and the same task. In other words, the effect replicates and does not replicate, but there is method to the madness and this is what matters theoretically. Grammatical aspect interacts with lexical aspect (accomplishment vs. activities). In other words, we have a better understanding of how grammatical aspect affects processing.

Should we (in 2003) have mentioned that our prediction only applied to accomplishments? Yes, I think so. However, this limitation had simply not occurred to us.

There are several lessons here.

(1) It is important to specify the limitations of our findings. However, (a) it is not always possible to do so and (b) sometimes we lack the perspective to do so. We definitely shouldn't oversell our results, though.

(2) As is obvious, there are degrees of directness in replications. Replicating a finding in a different language is powerful support for a prediction. It is, however, not a direct replication of an effect.

(3) Science proceeds by performing replications and extensions of predictions and by detecting their limitations (i.e., failed extensions).

(4) It’s easy to misremember studies. I do it all the time.

I plan to return in a future post to explore the question of how best to determine the generalizability of our predictions.

Monday, March 7, 2016

Why Continue to Elicit False Confessions from the Data?

The other night I was watching a Dateline episode on a false confession that landed someone in jail for a crime he didn’t commit. The story is quite similar to that of Brendan Dassey in Making a Murderer: a learning-disabled boy being coerced by detectives into falsely confessing to having committed heinous crimes. These are very upsetting stories. Even more upsetting is that false confessions are quite common in the US and have led to a great number of wrongful convictions.

It occurred to me that the false confession debate provides an intriguing analogy with the replication debate, which was recently reignited after the publication of a critique in Science of the Reproducibility Project. Many people have written great blogposts about this latest controversy already (e.g., Uri Simonsohn, Simine Vazire, Sanjay Srivastava, Michael InzlichtDorothy Bishop, Daniël Lakens, and David Funder). This post approaches the debate from a different angle. I explore whether the false confession analogy holds a lesson for the reproducibility debate.

In a false confession, a suspect is pushed, cajoled, and bullied by one or more police detectives to confess to having committed a crime. In the Dateline and Making a Murderer cases, it is plainly visible (the interrogations were recorded) that the heinous acts the suspect is made to confesses were all suggested by the detectives themselves. The suspect just randomly guesses until he produces the desired answer. 

Maybe the detectives honestly believe that the suspect has committed the crime and that they are just forcing him to own up to the facts, as a result of confirmation bias. Or maybe the detectives don’t really believe the suspect is guilty at all but they need (or are pushed by those higher on the ladder) to make a quick arrest to meet some quota.

Dateline points out that in the UK (and presumably in many other countries) confessions elicited under duress are no longer admissible in court. The interview process has undergone a complete overhaul. This overhaul was initially resisted by veteran detectives. As an English police detective puts it: “Senior people thought that this was a draconian piece of legislation that was gonna prevent us from detecting anything ever again […], that it was going to tie our hands behind our back.” But they were wrong, as Dateline presenter Keith Morrison intones. Detection rates in homicide cases in the UK are over 90%. This makes sense, of course. The police is no longer wasting time on innocent suspects and can now devote its resources to actually solving the crime.

The replication crisis is the product of a whole tradition of extracting false confessions from the data. Researchers push and cajole the data as long as they need for the data to “give up” the effect (not reporting nonsignificant effects, optional stopping, selectively removing outliers, and so on). Whether the researchers really believe the data harbor the predicted effect or not, is a question that no one may be able to answer to but my guess is the vast majority of researchers sincerely did/do.

One side of the replication debate wants us to progress toward the UK situation. Eliciting false confessions from the data is no longer admissible in the court of science and new policies are proposed or in place to curb their use (preregistration, open data, open code, open reviews). This is morally the right thing to do, of course, but it also makes great practical sense. Why commit further resources pursuing “effects”that are likely false confessions? Better to direct our gaze elsewhere.

The other side of the replication debate seems to want to continue the tradition of extracting false confessions from the data. Like the senior police detectives in the UK, they bemoan policies that journals have put in place to curb reliance on false confessions. I suspect members of this side of the debate will turn out to be like those greybeards on the UK police force: on the wrong side of history.