Thursday, March 21, 2013

Beware of Voodoo Experimentation

In my previous post I described our replication attempt of Experiment 1 from Vohs and Schooler (2008). They found large effects of a manipulation of belief in free will (via the reading of passages) on people’s reported belief in free will and on subsequent cheating behavior. We tried to replicate these findings using Mechanical Turk but obtained null results.

What might account for the stark differences between our findings and those of V&S? And, in the spirit of the educational roots of this project, what lessons can we learn from this attempt at replication?

One obvious difference between our findings and those of V&S is in subject populations. Our subjects had an average age of 33 (range 18-69) and were native speakers of English residing in the US (75 males and 77 females). The distribution of education levels was as follows: high school (13%), college no-degree (33%), associate’s degree (13%), bachelor (33%), and master’s/PhD (8%).

How about the subjects in the original study? V&S used… 30 undergraduates (13 females, 17 males); that’s all it says in the paper. Kathleen Vohs informed us via email that the subjects were undergraduates at the University of Utah. Specifically, they were smart, devoted adults about half of whom were active in the Mormon Church. One would think that it is not too trivial to mention in the paper. After all, free will is not unimportant to Mormons, as is shown here and here. It is quite true that Psychological Science imposes rather stringent word limits but still…

Lesson 1: Critical information should not be omitted from method sections. (This sounds like flogging a dead horse, but try to replicate a study and you’ll see how much information is often missing.)

So there clearly is a difference between our subject population and that of the original experiment. We did not ask about religious affiliation (we did not know this was important, as it was not mentioned in the original paper), but I doubt that we are going to find 30 Mormons, 15 among them active in the Mormon Church, in our sample.

What we can do, however, is match our sample in terms of age (this is also not specified in the original article, but let’s assume late teens to mid-twenties) and level of education. In an analysis of 30 subjects meeting these criteria, we found no significant effects on the manipulation check and cheating behavior.

So differences in age and level of education from the original sample do not seem to account for our null findings. We cannot be sure, however, whether membership in the Mormon Church plays a role.

Another big difference between our experiment and the original is that our experiment was conducted online and the original in the lab. It has been demonstrated that many classical findings in the psychological literature can be replicated using online experiments (e.g., here) but this doesn’t mean online experiments are suitable for any task. 

An obvious issue is that an online study cannot control the environment. To get some idea about the subjects' environment we always ask them to indicate on a 9-point scale the amount of noise in their environment, with 1 being no noise and no distractions and 9 noise and many distractions. The average score was 1.6 on this scale. The majority of subjects (73%) indicated that they were in a quiet environment with no distractions. An additional 11% indicated they were in a quiet environment with some distractions. Very few people indicated being in a noisy environment with distractions. Of course, these are self-report measures but they do suggest that environmental distractions are not a factor.

Perhaps subjects did not read the manipulation-inducing passages. There is no information on this in the original study but we measured reading times. The average reading time for the passages was 380 ms/word, which is quite normal for texts of this type. There were a few subjects with unusually short reading times. Eliminating their data did not change the results. So from what we can tell, the subjects read the texts and did not click through them. There is no information about reading times in the original experiment. In fact, it would have been even better (for both the original study and the replication attempt) to also have comprehension questions about the passages at the end of the experiment.

Lesson 2: gather as much information about your manipulation-inducing stimuli as possible.

Another potential problem, which was pointed out by a commenter on the previous post, is that some subjects on Mechanical Turk, “Turkers,” may already have participated in similar experiments and thus not be na├»ve to the manipulation (see here for a highly informative paper on this topic).

We always ask subjects about their perceptions of the manipulation and this experiment is no exception. We coded a perception as “aware of the manipulation” if it mentioned “honesty”, “integrity”, “pressing the space bar,” “looking at the answer”, “following instructions,” or something similar. We coded someone as “unaware” if they explicitly stated that they had no idea or if the mentioned a different purpose of the experiment. Some examples are: (1) The study was about judgments and quickness, (2) Deterioration of short term memory, and (3) How quickly people can solve math problems.

According to these criteria, about half the subjects were “aware” of the manipulation. We performed a separate follow-up analysis on the “unaware” subjects. There still was no effect of the manipulation on the amount of cheating. We did find a slightly higher number of incidences of cheating among the “aware” subjects than on the “unaware” subjects. All in all, though, the level of cheating was much lower than in the original study.

So does awareness of the manipulation explain our null findings? I don’t think so. Some commenters on the previous post decried our study for having so many “aware” subjects. They should realize that we don't even know if all 30 subjects in the original study believed the cover story; there is no information on this in the article.

Lesson 3: always ask subjects about their perceptions of the purpose of the experiment.

I find it hard to believe that the subjects in the original experiment all bought the cover story. Unlike in our experiment, the original study has no information on how many people disbelieved the cover story. Some commenters have suggested that it is easier to convince people of the cover story if you have an actual experimenter. This seems plausible although it still doesn't seem likely to me that everyone would have believed the story. And of course it would be an awful case of circular reasoning to say that the subjects must have believed the manipulation simply because there was a large effect.

But there is a bigger point. If the large effect reported in the original study hinges on the acting skills of the experimenter, then there should be information on this in the paper. The article merely states that the subjects were told of the glitch. We incorporated what the students were told in our instruction. But if it is not the contents of what they were told that is responsible for the effect but rather the manner in which it is told, then there should be information on this. Did the experimenter act distraught, confused, embarrassed, or neutrally? And was this performance believable and delivered with any consistency? If the effect hinges on the acting skills of an experimenter, experimentation becomes an art and not a science. In addition to voodoo statistics, we would have voodoo experimentation. (A reader of this post pointed me to this highly relevant article on the ubiquity of voodoo practices in psychological research.)

It should be obvious but I’d like to state it explicitly anyway, I’m not saying that V&S performed voodoo experimentation. I am just saying that if the claim is that the effect relies on factors that are not (or cannot be) articulated and documented—and I’ve heard people (not V&S) make this claim—then we have voodoo experimentation.

Lesson 4: Beware of Voodoo Experimentation

It is striking that we were not even able to replicate the manipulation check that V&S used. I was told by another researcher (who is also performing a replication of the V&S experiment) that the reliability of the original manipulation check is low (we had not thought to examine this, but we did use the updated version of this scale, the FAD-plus). I do not want to steal this researcher’s thunder, and so will not say anything more about this issue at this point (I will provide an update as soon as the evidence from the that researcher's experiment is available). But the fact that we did not replicate the large effect on the manipulation check that was reported in the original study might not count as a strike against our replication attempt.

So where does this leave us? <p><span style="display:none">claimtoken-515be493dc514</span></p> The fact that the large (!) effect of the original study completely evaporated in our experiment cannot be due to (1) the age or education levels of the subjects, (2) subjects not reading the manipulation-inducing passages (if reading times are any indicator), and (3) subjects’ awareness of the manipulation. The original paper provides no evidence regarding these issues.

The evaporation of the effect could, however, be due to (1), the special nature of the sample of the original sample (2) the undocumented acting skills of a real-life experimenter (voodoo experimentation), or of course (3) the large effect being a false positive. I am leaning towards the third option, although I would not find a small effect implausible (in fact, that is what I was initially expecting to find).

Monday, March 18, 2013

The Value of Believing in Free Will: A Replication Attempt

update February 26, 2014. Early March we'll be submitting a manuscript that includes both the experiment described here and another replication attempt run in the lab.

Earlier this year I taught a new course titled Foundations of Cognition. The course is partly devoted to theoretical topics and partly to methodological issues. One of the theoretical topics is free will and one of the methodological topics is replication. There is a lab associated with the course and I thought we’d be killing two birds with one stone if we’d try to replicate a study that was discussed in the first, theoretical, part of the course. The students would then have hands-on experience with replication of a study that they were familiar with. Moreover, we could discuss the results in the context of the methodological literature that we read in the second part of the course.

The experiment I had selected for our replication attempt was Experiment 1 from Vohs & Schooler (2008) on whether a lowered belief in free will would lead people to cheat more. I thought that this was a relatively simple experiment—in terms of programming—that could be run on Mechanical Turk (we needed to be able to collect the data fast, given that it was a five-week course). My first impression after a cursory reading of the article was that we might replicate the result.

In the experiment, subjects read one of two texts, both passages from Francis Cricks 1994 book The Astonishing Hypothesis. One passage argues that free will is an illusion and the other passage discusses consciousness but does not mention free will. These texts were cleverly chosen, as they are similar in terms of difficulty and writing style. After reading the passages, the subjects complete the Free Will and Determinism scale and the PANAS.

Next comes the meat of the experiment. Subjects solve 20 mental-arithmetic problems (e.g., 1 + 8 + 18 - 12 + 19 - 7 + 17 - 2 + 8 – 4 = ?) but are told that due to a programming glitch, the correct answer will appear on the screen and that they can make it disappear by pressing the spacebar. So if the subject does not press the spacebar we know they are cheating. Vohs and Schooler (V&S) found that the subjects who had read the anti-free-will text cheated more often than those who had read the neutral text. More about the results later.

My graduate student, Lysanne Post, who is collaborating with me on this, contacted the first author of the paper, informing her about our replication attempt. She was helpful in providing information that could not be gleaned from the paper. It turns out the experiment was run in 2003 and the first author did not remember all of the details of that study. But with the information that was provided and some additional sleuthing we were able to reconstruct the experiment.

We ran the experiment on Mechanical Turk, using 150 subjects. This should give us awesome power because the original experiment used 30 subjects and the effect size was large (.82).

In V&S's study, subjects in the AFW condition reported weaker free will beliefs (M = 13.6, SD = 2.66) than subjects in the control condition (M = 16.8, SD = 2.67).  In contrast, we found no difference between the AFW condition (M = 25.90, SD = 5.35) and the control condition (M = 25.11, SD = 5.37), p = .37. Also, our averages are noticeably higher than V&S’s.

How about the effect on cheating?

V&S found that subjects in the AFW condition cheated more often (M = 14.00, SD = 4.17) than subjects in the control condition (M = 9.67, SD = 5.58), p < .01, an effect of almost one standard deviation! In contrast, we found no difference in cheating behavior between the AFW condition (M = 4.53, SD = 5.66) and the control condition (M = 5.97, SD = 6.83), p = .158. Clearly, we did not replicate the main effect. It is also important to note that the average level of cheating we observed was much lower than that in the original study.

V&S reported a .53 correlation between scores on the Free Will subscale and cheating behavior. We, on the other hand, observed a nonsignificant .03 correlation.

There was a further issue. About half our subjects indicated they did not believe the story about the programming glitch (we kind of feared that this might happen). We analyzed the data separately for “believers” and “nonbelievers” but found no effect of condition in either group.

What might account for this series of stark differences between our findings and those of V&S? I will discuss some ideas in my next post. Then I will also talk about some lessons we learned from this replication attempt. Meanwhile, it might be good to reference my first post, which talks about the why of doing replication studies. 

Wednesday, March 13, 2013

Assessing the Armada: Language Comprehension and the Motor System

In the wake of the discovery of mirror neurons an armada of studies on the role of the brain’s motor system in language processing has appeared over the horizon the past decade. We review some of this work here. Behavioral studies have shown interactions between reading and motor tasks and brain-imaging studies have shown that (pre)-motor areas of the brain are active during the processing of action words and action sentences.

Some researchers have taken mirror neuron theory and these results to mean that the motor system plays a central role in language comprehension, whereas others are downright skeptical about the role of the motor system.

In our own behavioral studies, we have found interactions between language comprehension and motor actions. Although one can draw limited conclusions from such experiments, they do suggest that motor resonance is modulated by sentence context. You can observe interactions between reading and action only when the focus of the sentence is on the action and even when the target word is not an action verb.

In a recent fMRI study (well, recent...I actually had the idea for this study in 2007), we again found evidence that motor resonance is modulated by sentential context. We presented Dutch subjects with Dutch sentences (somehow this made more sense to us than presenting them with Mongolian ones) that contained a subordinate clause. In main clauses, Dutch is, like English, a subject-verb-object (SVO) language. However, in subordinate clauses, Dutch uses a SOV order, which means that the verb is at the end of the sentence.

We made use of this feature of the Dutch language because we wanted to examine the effect of sentence context on the motor activation elicited by action verbs. To this end, we contrasted literal and nonliteral sentences. Here are two examples.

1. Iedereen was blij toen oma de taart aansneed. (Everyone was happy when grandma the cake cut.)
2. Iedereen was blij toen oma een ander onderwerp aansneed. (Everyone was happy when grandma a different topic broached.)

So here we have the same target verb (aansneed) at the end of the sentence. In one case it refers to a manual action and in the other to a mental/verbal one (thank God grandma stopped grandpa from telling that boring fishing story for the zillionth time). If motor activation is verb-driven, then the target verb should elicit similar amounts of (pre)-motor activation for literal and nonliteral sentences. However, if motor activation is modulated by sentence context, then there should be more motor activation elicited by literal sentences than by nonliteral ones.

There were more components to this study (for example concerning somatotopy) but I just want to focus on the literal/nonliteral comparison. We found more motor activation in structurally defined motor areas BA4 (the primary motor cortex) and BA6 (the premotor cortex) for literal than for nonliteral sentences. In other words, we found that sentence context modulates motor activation. Other studies have found similar patterns and they are discussed in our paper.

So motor resonance seems to be driven by conceptual combination rather than by action verbs themselves. I am working on a theoretical account for this and for related findings, which I will describe in a later post and/or paper, but I want to look at a different question here.

Virtually all of the research on language and motor resonance has focused on individual words or sentences. Using these “textoids” might yield a very skewed view of language comprehension. Specifically, in this case it might lead one to overestimate the role of the motor system in language processing.

However you want to slice the cake, even if the (pre)motor cortex reliably responds to every occurrence of the word kick in a story and even if you can establish a causal role for motor activation, what does this tell us about the role of motor activation in discourse comprehension?

If you look even at simple stories like The Ugly Duckling, you’ll find that verbs denoting simple actions are just not very common. Stories—let alone expository texts—tend to be about bigger things than kicking a ball, handing over a pizza, or screwing in a light bulb.

Some years ago, my then student Larry Taylor and I wrote a paper on language comprehension as fault-tolerant processing. We argued that language can be understood at different levels. A schematic understanding can be had by combining cues from grammar and the closed-class elements of a sentence (function words, suffixes such as –ed). A deeper understanding requires the establishment of causal links between the events described in a sentence, a situation model.

A yet deeper understanding presumably involves a first-person mental simulation of the described events, such that not only the causal connection between the events is established but also the manner in which this connection is formed. We give a concrete example of this in the paper.

According to this reasoning, the role of motor activation is limited to the deep understanding of simple events. Given the small role that simple actions play in narrative and nonnarrative discourse, my current view is that the motor system plays a supportive role in discourse comprehension. It helps “flesh out” representations of simple actions.