Friday, September 27, 2013

30 Questions about Priming with Science and the Department of Corrections

We know about claims that priming with “professor” makes you perform better on a general knowledge test but apparently the benefits of science don’t stop there. A study published earlier this year reports findings that priming with science-related words (logical, theory, laboratory, hypothesis, experiment) makes you more moral. Aren’t we scientists great or what? But before popping the cork on a bottle of champagne, we might want to ask some questions, not just about the research itself but also about the review and publishing process involving this paper. So here goes.

(1) The authors note (without boring the reader with details) that philosophers and historians have argued that science plays a key role in the moral vision of a society of “mutual benefit.” From this they derive the prediction that this notion of science facilitates moral and prosocial judgments. Isn’t this a little fast?
(2) Images of the “evil scientist” (in movies usually portrayed by an actor with a vaguely European accent) pervade modern culture. So if it takes only a cursory discussion of some literature to form a prediction, couldn’t one just as easily predict that priming with science makes you less moral? I’m not saying it does of course; I’m merely questioning the theoretical basis for the prediction.
(3) In Study 1, subjects read a date rape vignette (a little story about a date rape). The vignette is not included in the paper. Why not? There is a reference to a book chapter from 2001 in which that vignette was apparently used in some form (was it the only one by the way?) but most readers will not have direct access to it, which makes it difficult to evaluate the experiment. In other disciplines, such as cognitive psychology, it has been common for decades to include (examples of) stimuli with articles. Did the reviewers see the vignette? If not, how could they evaluate the experiments?
(4) The subjects (university students from a variety of fields) were to judge the morality of the male character’s actions (date rape) on a scale from 1 (completely right) to 100 (completely wrong). Afterwards, they received the question “How much do you believe in science?” For this a 7-point scale was used. Why a 100-point scale in one case and a 7-point scale in the other? The authors may have good reasons for this but they play it close to the vest on this one.
(5) In analyzing the results, the authors classify the students’ field of study as a science or a non-science. Psychology was ranked among the sciences (with physics, chemistry, and biology) but sociology was deemed a non-science. Why? I hope the authors have no friends in the sociology department. Communication was also classified as a non-science. Why? I know many communication researchers who would take issue with this. The point is, this division seems rather arbitrary and provides the researchers with several degrees of freedom.
(6) The authors report a correlation of r=.36, p=.011. What happens to the correlation if, for example, sociology is ranked among the sciences?
(7) Why were no averages per field reported, or at least a scatterplot? Without all this relevant information, the correlation seems meaningless at best. Weren't the reviewers interested in this information? And how about the editor?
(8) Isn’t it ironic that the historians and philosophers, who in the introduction were credited with having introduced the notion of science as moral force in society are now hypothesized to be less moral than others (after all, they were ranked among the non-scientists)? This may seem like a trivial point but it really is not when you think about it.
(9) Study 2 uses the vaunted “sentence-unscrambling task” to prime the concept of “science.” You could devote an entire blog post to this task but I will move on only to make a brief observation. The prime words were laboratory, scientists, hypothesis, theory, and logical. The control words were…. Well what were they? The paper isn’t clear about it but it looks like paper and shoes were two of them (there’s no way to tell for sure and apparently no one was interested in finding out). 
(10) Why were the control words not low-frequency long words (assuming shoe and paper are representative for this category) that are low in imageability like the primes? Now the primes stick out like a sore thumb among the other words from which a sentence has to be formed whereas the control words are a much closer fit.
(11) Doesn’t this make the task easier in the control condition? If so, there is another confound.
(12) Were the control words thematically related, like the primes obviously were?
(13) If so, what was the theme? If not, doesn’t it create a confound to have salient words in the prime condition that are thematically related and can never be used in the sentence and to have non-salient words in the control condition that are not thematically related?
(14) Did the researchers inquire after the subjects’ perceptions of the task? Weren't the reviewers and editor curious about this?
(15) Wouldn’t these subjects have picked up on the scientific theme of the primes?
(16) Wouldn’t this have affected their perceptions of the experiment in any way?
(17) What about the results? What about them indeed? Before we can proceed, we need to clear up a tiny issue. It turns out that there are a few booboos in the article. An astute commenter on the paper had noticed anomalies in the results of the study and some impossibly large effect sizes. The first author responded with a string of corrections. In fact, no fewer than 18 of the values reported in the paper were incorrect. Here, I’ve indicated them for you.

You will not find them in the article itself. The corrections can be found in the comment section.
(18) It is good thing that PLoS ONE has a comment section of course. But the question is this. Shouldn’t such extensive corrections have been incorporated in the paper itself? People who download the pdf version of the article will not know that pretty much all the numbers that are reported in the paper are wrong. That these numbers are wrong is the author’s fault but at least she was forthcoming in providing the corrections. It would seem to be the editor's and publisher's responsibility to make sure the reader has easy access to the correct information. The authors would also be served well by this. 
(19) In her correction (which comprises about 25% the size of the original paper), the first author explains that the first three studies were reran because the reviewer requested different, more straightforward dependent variables that directly assessed morality judgments rather than related judgments related to punitiveness or blame, or that were too closely tied to the domain of science, which were used in the original submission. Apparently, many of the errors occurred because the manuscript was not properly updated with the new information. Why did the reviewers and editor miss all of these inconsistencies, though?
(20) And what happened to the discarded experiments? Surely they could have been included along with the new experiments? There are no word limitations at PLoS ONE.  Having authored a 14-experiment paper that was recently published in this journal, I'm pretty sure I'm right on this one.

Let’s return to the paper armed with the correct (or so we assume) results.

(21) The subjects in Study 2 were primed with “science” or read the neutral words (which were not provided to the reader) and then read the date rape vignette (which was not provided to the reader) and made moral judgments about the actions in the vignette (whatever they were). The corrected data show that the subjects in the experimental condition rated the actions as more immoral than did the control condition. However, as the correction also states, the standard deviation was much higher in the control condition (28.02) than in the experimental condition (7.96). These variances are highly unequal; doesn’t this compromise the t-test that was reported?
(22) The corrections mention that the high variance in the neutral condition is caused by two subjects, one giving the date rape a 10 on the 100-point scale (in other words, finding it highly acceptable) and the other a 40. The average for that condition is 81.57, so aren’t these outliers, at least the 10 score? (By the way, was this date-rape approving subject reported to the relevant authorities?)
(23) In Study 3 subjects received the same priming manipulation as in Experiment 2 and they rated the likelihood that they would engage in one of the several activities the next month, some of which were prosocial, some which were not. The prosocial actions listed were giving to charity, giving blood, and volunteering. Were these all the actions that were used in the experiment? It is not clear from the paper.
(24) Were the values that were used in the statistical test the averages of the responses to the categories of items (e.g., the average rating for the three prosocial actions)?
(25) And what happened to the non-prosocial activities? Shouldn't a proper analysis have included those in a 2 (prime) by 2 (type of activity) ANOVA? 
(26) If this analysis is performed, is the interaction significant?
(27) In the corrected data the effect size is .85. Doesn’t this seem huge? Readers of my previous post already know the answer: Yes, to the untrained eye perhaps but it is the industry standard (Step 7 in that post).
(28) The corrections state that Study 4 originally contained a third condition but that it was left out at the behest of a reviewer who felt that it muddles rather than clarifies the findings (yes, we wouldn’t want the findings to be muddled, would we?). I appreciate the honesty but was everyone, including the editor, on board with this serious amputation?
(29) The initial version of the corrections (yes, I forgot to mention that there were two versions of corrections) mentioned that there were 26 participants in the control condition and 17 in the experimental condition. Where does this huge discrepancy come from? And does it affect the analyses?
(30) In the discussion it is mentioned that Study 2 investigated academic dishonesty. This was one of the experiments that was dropped, right? Another (minor) addition for the corrections perhaps.

I guess there are a great many more questions to ask but let me stop here. The article uses logical, hypothesis, theory, laboratory, and scientist as primes. I can make a sentence out of those: Absent a theory, it is logical that there is no basis for the hypothesis that was tested in the laboratory and (sloppily) reported by the scientist

[Update, April 10, 2014. As I found out only recently (if you're forming a rapid response team, don't forget not to invite me), back in September of last year, the first author of the PLoS ONE article addressed (most of) these questions in the comments section of that article. The response provides more information and acknowledges some weaknesses of the study.]


  1. Anyone designing their own scale should read first.

    In this case, why did the scale for the male character's actions use 1 for saintly and 100 for awful? Why not 1 (or, better, 0) for awful and 100 for good? I can imagine quite a few people instinctively awarding him a "low number" as a "bad person".

  2. Basically there is one problem with this: you can't run an "effect of science-y words" experiment on university students alone. You would expect a ceiling effect in this population (seriously, which sample would be more pro-science than uni students?), making it all the more likely that it is the control condition that has the effect. In fact, there is not really a control condition.

  3. I was the author of the comment that prompted the corrections. I agree with you that PLoS ONE should make the corrections a more integral part of the main article. If you're interested, I blogged about my experiences with this article:

    Regarding your 30 questions, some of the questions are specific to this particular paper. But others -- such as the fact that they didn't provide all of the vignettes and stimuli, or report every imaginably interesting analysis -- are questions you could ask about nearly every study published in every venue. Why isn't it the norm to publish all stimuli and procedural details in a supplement? Or report dead ends and experiments that did not "work"? Or post the raw data so readers can run supporting analyses or see what happens if they rescore or reslice the data in different ways? They are very good questions, but it seems unnecessarily narrow -- and perhaps a touch unfair -- to be asking them about just this study.

    1. Thanks Sanjay, I know it was you. I resonate with your posts on this topic. You are right that it would be unfair to ask these questions just about this study. I have asked them about another study as well, as you will know
      I can only ask these questions one study at a time and I'm not done yet.;) Moreover, the questions are not just addressed to the authors but also to the reviewers, and editor, as I've tried to make clear in this post and in the 50 Questions one.

    2. Two further notes. First, I'm not asking about "any imaginable analysis"; I'm asking about analyses that should have been performed. Second, the vignette is critical to the experiments. How can anyone (especially the reviewers) judge the experiments without the vignette? As I point out in the post, it is very common in other areas to publish stimuli (or examples); and it has been like this for decades.

  4. Very interesting, thank you. As for me, I was surprised about the results of Study 3 about prosocial intentions. I have many friends who actually went to the Arts School and I know that volunteering and various fundraising activities have always been an integral part of their life. I myself (as an undegraduate student) used to take part in various theatrical performances and fundraising has always been a goal of most of them. Therefore, despite the empirical evidence provided by the authors, I suspect that participants from art and theatre (even when not primed with scientific words) would most likely report greater prosocial intentions, compared to students from other fields who are primed with scientific words. In brief, it would be interesting to find out if the results would remain the same if to compare students from art and theatre only (as a control group) with the "primed" (target) group from other fields. I am less sure that "coding for field of study into that science vs. non-science was based on whether the field relied primarily on empirical methods of experimentation" (authors' response to your post-publication review) is the best division.

  5. Did know if you'd seen it, but thought you might find this article from The Economist interesting: