We know about claims that priming with “professor” makes you
perform better on a general knowledge test but apparently the benefits of science don’t
stop there. A study
published earlier this year reports findings that priming with
science-related words (logical, theory, laboratory, hypothesis, experiment)
makes you more moral. Aren’t we scientists great or what? But before popping the
cork on a bottle of champagne, we might want to ask some questions, not just about
the research itself but also about the review and publishing process involving this paper. So here
goes.
(1) The authors note (without boring the reader with details)
that philosophers and historians have argued that science plays a key role in
the moral vision of a society of “mutual benefit.” From this they derive the
prediction that this notion of science facilitates moral and prosocial
judgments. Isn’t this a little fast?
(2) Images of the “evil scientist” (in movies usually portrayed by an
actor with a vaguely European accent) pervade modern culture. So if
it takes only a cursory discussion of some literature to form a prediction,
couldn’t one just as easily predict that priming with science makes
you less moral? I’m not saying it does of course; I’m merely questioning the theoretical
basis for the prediction.
(3) In Study 1, subjects read a date rape vignette (a little
story about a date rape). The vignette is not included in the paper. Why not?
There is a reference to a book chapter from 2001 in which that vignette was
apparently used in some form (was it the only one by the way?) but most readers
will not have direct access to it, which makes it difficult to evaluate the
experiment. In other disciplines, such as cognitive psychology, it has been
common for decades to include (examples of) stimuli with articles. Did the reviewers see the vignette? If not, how could they evaluate the experiments?
(4) The subjects (university students from a variety of
fields) were to judge the morality of the male character’s actions (date rape)
on a scale from 1 (completely right) to 100 (completely wrong). Afterwards,
they received the question “How much do you believe in science?” For this a
7-point scale was used. Why a 100-point scale in one case and a 7-point scale
in the other? The authors may have good reasons for this but they play it close to the vest on this one.
(5) In analyzing the results, the authors classify the
students’ field of study as a science or a non-science. Psychology was ranked
among the sciences (with physics, chemistry, and biology) but sociology was
deemed a non-science. Why? I hope the authors have no friends in the sociology department. Communication was also classified as a non-science. Why? I know many communication researchers who would take issue with this. The point is, this division seems rather arbitrary and provides the researchers with several degrees of freedom.
(6) The authors report a correlation of r=.36, p=.011. What
happens to the correlation if, for example, sociology is ranked among the
sciences?
(7) Why were no averages per field reported, or at least a
scatterplot? Without all this relevant information, the correlation seems
meaningless at best. Weren't the reviewers interested in this information? And how about the editor?
(8) Isn’t it ironic that the historians and philosophers,
who in the introduction were credited with having introduced the notion of science as
moral force in society are now hypothesized to be less moral than others (after
all, they were ranked among the non-scientists)? This may seem like a
trivial point but it really is not when you think about it.
(9) Study 2 uses the vaunted “sentence-unscrambling task”
to prime the concept of “science.” You could devote an entire blog post to this
task but I will move on only to make a brief observation. The prime words were laboratory, scientists, hypothesis, theory, and
logical. The control words were….
Well what were they? The paper isn’t clear about it but it looks like paper and shoes were two of them (there’s no way to tell for sure and apparently no one was interested in finding out).
(10) Why were the control words not low-frequency long words (assuming shoe and paper are representative for this category) that are low in imageability like the primes? Now the primes stick out like a sore
thumb among the other words from which a sentence has to be formed whereas the
control words are a much closer fit.
(11) Doesn’t this make the task easier in the control condition?
If so, there is another confound.
(12) Were the control words thematically related, like the
primes obviously were?
(13) If so, what was the theme? If not, doesn’t it create a
confound to have salient words in the prime condition that are thematically
related and can never be used in the sentence and to have non-salient words in
the control condition that are not thematically related?
(14) Did the researchers inquire after the subjects’
perceptions of the task? Weren't the reviewers and editor curious about this?
(15) Wouldn’t these subjects have picked up on the
scientific theme of the primes?
(16) Wouldn’t this have affected their perceptions of the
experiment in any way?
(17) What about the results? What about them indeed? Before
we can proceed, we need to clear up a tiny issue. It turns out that there are a
few booboos in the article. An astute commenter on the paper had noticed anomalies
in the results of the study and some impossibly large effect sizes. The
first author responded with a string of corrections.
In fact, no fewer than 18 of the values reported in the paper were incorrect.
Here, I’ve indicated them for you.
You will not find them in the article itself. The
corrections can be found in the comment section.
(18) It is good thing that PLoS ONE has a comment section of course. But the question is this.
Shouldn’t such extensive corrections have been incorporated in the paper
itself? People who download the pdf version of the article will not know that
pretty much all the numbers that are reported in the paper are wrong. That these numbers are
wrong is the author’s fault but at least she was forthcoming in providing the
corrections. It would seem to be the editor's and publisher's responsibility to make
sure the reader has easy access to the correct information. The authors would also be served well by this.
(19) In her correction (which comprises about 25% the size of the
original paper), the first author explains that the first three studies were
reran because the reviewer requested
different, more straightforward dependent variables that directly assessed
morality judgments rather than related judgments related to punitiveness or
blame, or that were too closely tied to the domain of science, which were used
in the original submission. Apparently, many of the errors occurred
because the manuscript was not properly updated with the new information. Why
did the reviewers and editor miss all of these inconsistencies, though?
(20) And what happened to the discarded experiments? Surely they could have been included along with the new
experiments? There are no word limitations at PLoS ONE. Having authored a 14-experiment paper that was recently published in this journal, I'm pretty sure I'm right on this one.
Let’s return to the paper armed with the correct (or so we assume)
results.
(21) The subjects in Study 2 were primed with “science” or
read the neutral words (which were not provided to the reader) and then read
the date rape vignette (which was not provided to the reader) and made moral
judgments about the actions in the vignette (whatever they were). The corrected
data show that the subjects in the experimental condition rated the actions as
more immoral than did the control condition. However, as the correction also
states, the standard deviation was much higher in the control condition (28.02)
than in the experimental condition (7.96). These variances are highly unequal;
doesn’t this compromise the t-test that was reported?
(22) The corrections mention that the high variance in the
neutral condition is caused by two subjects, one giving the date rape a 10 on
the 100-point scale (in other words, finding it highly acceptable) and the
other a 40. The average for that condition is 81.57, so aren’t these outliers,
at least the 10 score? (By the way, was this date-rape approving subject
reported to the relevant authorities?)
(23) In Study 3 subjects received the same priming
manipulation as in Experiment 2 and they rated the likelihood that they would
engage in one of the several activities the next month, some of which were
prosocial, some which were not. The prosocial actions listed were giving to
charity, giving blood, and volunteering. Were these all the actions that were
used in the experiment? It is not clear from the paper.
(24) Were the values that were used in the statistical test
the averages of the responses to the categories of items (e.g., the average
rating for the three prosocial actions)?
(25) And what happened to the non-prosocial activities? Shouldn't a proper analysis have included those in a 2 (prime) by 2 (type of activity) ANOVA?
(26) If this analysis is performed, is the interaction significant?
(27) In the corrected data the effect size is .85. Doesn’t
this seem huge? Readers of my previous post
already know the answer: Yes, to the untrained eye perhaps but it is the industry
standard (Step 7 in that post).
(28) The corrections state that Study 4 originally contained
a third condition but that it was left out at the behest of a reviewer who felt
that it muddles rather than clarifies the
findings (yes, we wouldn’t want the findings to be muddled, would we?). I
appreciate the honesty but was everyone, including the editor, on board with this serious amputation?
(29) The initial version of the corrections (yes, I forgot
to mention that there were two versions of corrections) mentioned that there
were 26 participants in the control condition and 17 in the experimental
condition. Where does this huge discrepancy come from? And does it affect the analyses?
(30) In the discussion it is mentioned that Study 2 investigated
academic dishonesty. This was one of the experiments that was dropped, right? Another (minor) addition for the corrections perhaps.
I guess there are a great many more questions to ask but let me stop here. The article uses logical, hypothesis, theory, laboratory, and scientist as primes. I can make a sentence out of those: Absent a theory, it is logical that there is no basis for the hypothesis that was tested in the laboratory and (sloppily) reported by the scientist.
[Update, April 10, 2014. As I found out only recently (if you're forming a rapid response team, don't forget not to invite me), back in September of last year, the first author of the PLoS ONE article addressed (most of) these questions in the comments section of that article. The response provides more information and acknowledges some weaknesses of the study.]
[Update, April 10, 2014. As I found out only recently (if you're forming a rapid response team, don't forget not to invite me), back in September of last year, the first author of the PLoS ONE article addressed (most of) these questions in the comments section of that article. The response provides more information and acknowledges some weaknesses of the study.]
Anyone designing their own scale should read http://academic.brooklyn.cuny.edu/economic/friedman/rateratingscales.htm first.
BeantwoordenVerwijderenIn this case, why did the scale for the male character's actions use 1 for saintly and 100 for awful? Why not 1 (or, better, 0) for awful and 100 for good? I can imagine quite a few people instinctively awarding him a "low number" as a "bad person".
Basically there is one problem with this: you can't run an "effect of science-y words" experiment on university students alone. You would expect a ceiling effect in this population (seriously, which sample would be more pro-science than uni students?), making it all the more likely that it is the control condition that has the effect. In fact, there is not really a control condition.
BeantwoordenVerwijderenI was the author of the comment that prompted the corrections. I agree with you that PLoS ONE should make the corrections a more integral part of the main article. If you're interested, I blogged about my experiences with this article:
BeantwoordenVerwijderenhttp://hardsci.wordpress.com/2013/03/25/reflections-on-a-foray-into-post-publication-peer-review/
http://hardsci.wordpress.com/2013/03/27/pre-publication-peer-review-can-fall-short-anywhere/
Regarding your 30 questions, some of the questions are specific to this particular paper. But others -- such as the fact that they didn't provide all of the vignettes and stimuli, or report every imaginably interesting analysis -- are questions you could ask about nearly every study published in every venue. Why isn't it the norm to publish all stimuli and procedural details in a supplement? Or report dead ends and experiments that did not "work"? Or post the raw data so readers can run supporting analyses or see what happens if they rescore or reslice the data in different ways? They are very good questions, but it seems unnecessarily narrow -- and perhaps a touch unfair -- to be asking them about just this study.
Thanks Sanjay, I know it was you. I resonate with your posts on this topic. You are right that it would be unfair to ask these questions just about this study. I have asked them about another study as well, as you will know http://rolfzwaan.blogspot.nl/2013/08/50-questions-about-messy-rooms-and.html.
VerwijderenI can only ask these questions one study at a time and I'm not done yet.;) Moreover, the questions are not just addressed to the authors but also to the reviewers, and editor, as I've tried to make clear in this post and in the 50 Questions one.
Two further notes. First, I'm not asking about "any imaginable analysis"; I'm asking about analyses that should have been performed. Second, the vignette is critical to the experiments. How can anyone (especially the reviewers) judge the experiments without the vignette? As I point out in the post, it is very common in other areas to publish stimuli (or examples); and it has been like this for decades.
VerwijderenVery interesting, thank you. As for me, I was surprised about the results of Study 3 about prosocial intentions. I have many friends who actually went to the Arts School and I know that volunteering and various fundraising activities have always been an integral part of their life. I myself (as an undegraduate student) used to take part in various theatrical performances and fundraising has always been a goal of most of them. Therefore, despite the empirical evidence provided by the authors, I suspect that participants from art and theatre (even when not primed with scientific words) would most likely report greater prosocial intentions, compared to students from other fields who are primed with scientific words. In brief, it would be interesting to find out if the results would remain the same if to compare students from art and theatre only (as a control group) with the "primed" (target) group from other fields. I am less sure that "coding for field of study into that science vs. non-science was based on whether the field relied primarily on empirical methods of experimentation" (authors' response to your post-publication review) is the best division.
BeantwoordenVerwijderenDid know if you'd seen it, but thought you might find this article from The Economist interesting: http://www.economist.com/news/briefing/21588057-scientists-think-science-self-correcting-alarming-degree-it-not-trouble
BeantwoordenVerwijderen