Not only 20-year-old students hypothesize about the effects
of environments on thought; social psychologists do too. My daughter’s hypothesis
is straightforward: messy environments
are distracting. The social psychologists’ hypotheses take us a little
further afield. For example: Messy
environments promote stereotyping. The paper
describing research into this hypothesis was co-authored by Diederik Stapel and
has been retracted. Another hypothesis is that messy environments promote a longing for simplicity. The paper
describing research into this hypothesis was co-authored by Dirk Smeesters and
has been retracted.
Now there is a new study on messiness. It is about to be published
in Psychological Science and has
already received a lot of press coverage.
The main findings are claimed to be that neat environments promote giving to
charity and healthy eating behavior whereas messy
environments promote creativity.
While I was reading the article, many questions arose. Given
their obviousness, I’m surprised that these questions did not occur to the researchers who wrote the paper, the reviewers who commented on the manuscript,
the editor who accepted the manuscript for publication, and the journalists who
wrote breathless news stories about it. So in the rest of this post I’m just
going to list these questions. I will not focus on theoretical aspects of the
study (or the lack thereof), which would have made the list even longer.
My questions follow the structure of the paper.
Experiment 1
Thirty-four Dutch
students participated. They were randomly assigned to an orderly or a
disorderly condition.
(1) Isn’t 34 a small N for a between-subjects design with a
subtle manipulation?
(2) At what university were these students? The authors are
at the University of Minnesota. (I learned via Twitter that the subjects were
likely run at Radboud University in Nijmegen, a university that the authors
are not affiliated with.)
(3) How many male and female students were in the sample?
(4) Is there any additional information on the subjects that
might be relevant?
We manipulated
environmental orderliness by having participants complete the study in an
orderly or disorderly room (Fig. 1).
(5) Doesn’t the “orderly room” look like a testing room and the “disorderly room” like someone’s office? Is, in other words, orderliness the only thing that is varied between conditions or are there one or more confounds?
Participants wrote the
amount, if any, they chose to donate on a sheet of paper, which they placed
into a sealed envelope (so that self-presentation concerns would be dispelled).
(6)
Did the subjects actually donate the money? If so, how was this accomplished?
Upon exiting, participants were allowed to
take an apple or chocolate bar, which constituted the measure of healthy food
choice.
(7) What was the motivation for using these particular snacks?
(8) Weren’t the authors worried that some people may never
eat chocolate whereas others never eat apples?
(9) Did the authors have independent information on the
subjects’ snack preferences?
Participants who
completed the study in the orderly room donated more than twice as much as
those who completed the study in the disorderly room (M = €3.19, SD
= 3.01, vs. M = €1.29, SD = 1.76), t (32) = 2.24, p = .03, d
= 0.73. Fully 82% of participants in the orderly room donated some
money, versus 47% in the disorderly room, χ2 (1, N = 34) = 4.64, p < .04, ϕ
= .37.
(10) Didn’t the authors/reviewers/editor find this a
surprisingly strong effect for such a small sample and such a subtle
manipulation?
Also as predicted,
participants in the orderly room chose the apple (over the chocolate) more
often than those in the disorderly room1
(M = 67% vs.
M = 20%), χ2 (1, N = 30) = 6.65, p < .05, ϕ
= .44.
(11) See previous question.
(12) How can the authors be sure that out of 30 subjects
randomly assigned to two conditions the numbers who normally prefer apples over
chocolate were about equal before the manipulation? Weren’t they worried that
an over-representation of apple lovers in the disorderly room would destroy
their hypothesized effect? If not, why were they unconcerned about this?
(13) What did the subjects do with the snacks? Eat them?
Give them away? Dump them in the trash?
(14) Does it count as a healthy choice if someone selects an
apple but then doesn’t eat it?
(15) How were the snacks presented? Was the chocolate in a
wrapper? And how about the apple?
(16) What would have happened if more subjects in the
orderly room had selected the chocolate? Would the authors have post-hoc
hypothesized that some compensatory mechanism was at work? (Chocolate
counteracts the effects of being in sterile environments.)
(17) Was no one concerned that the donation task might
influence the snack-selection task?
(18) Was no one concerned about demand effects?
(19) Were these two tasks the only tasks that were
performed?
Experiment 2
Given that orderliness
is paired with valuing convention, a disorderly state should encourage breaking
with convention, which is needed to be creative (Simonton, 1999). Therefore, we
predicted that being in a disorderly environment would have the desirable
effect of stimulating creativity.
(20) Did the reviewers/editor consider this a convincing
rationale for the prediction?
Forty-eight American
students participated in a two-condition (orderly vs. disorderly environment)
design.
(21) What was these students’ affiliation?
(22) How many males vs. females were in the experiment?
Participants completed
tasks in a room arranged to be either orderly or disorderly (Fig. 2).
(23) Are the authors/reviewers/editor/journalists serious? Is this really the same manipulation of orderliness as in Experiment 1? The room looks orderly alright in the picture on the left but on the right it looks like some errant groundskeeper had just wandered in with a leaf blower on at full blast.
(24) Do de authors/reviewers/editor seriously believe that
orderliness is the only dimension along which the two rooms differ? Are there
no confounds?
(25) What did the subjects say upon entering the disorderly room? Did they perchance say Is this a practical joke? In others words, did they take the experiment seriously? As seriously at least as those in the orderly room?
(25) What did the subjects say upon entering the disorderly room? Did they perchance say Is this a practical joke? In others words, did they take the experiment seriously? As seriously at least as those in the orderly room?
Two coders, blind to
condition, rated each idea on a 3-point scale (1 = not at all creative , 3 = very
creative ; κ = .81, p < .01); disagreements were resolved
through discussion.
(26) What were the criteria that were used by the raters?
What is an example of a “very creative” idea?
Results (all
effects were significant and effect sizes were large)
(27) Was nobody surprised about this? Not the authors, not
the reviewers, and not the editor?
Discussion
It could be that our
disorderly laboratory violated participants’ expectations
(28) Was this sentence included for comical effect? If so,
it worked.
Our preferred
explanation, though, is that cues of disorder can produce creativity because
they inspire breaking free of convention
(29) Did the reviewers/editor consider this a satisfactory
explanation? I mean, when it comes to ice cream flavors, I prefer pistachio to
strawberry. Of course, the ice cream vendor doesn’t demand an explanation; he’s
just as happy to sell me the pistachio as he is to sell me the strawberry. We’re talking not about ice cream flavors here, though, but about science, so shouldn't people be held to a higher standard than merely stating their
preferences?
(30) Didn’t anyone find it ironic that the alternative
explanation is supported with a reference whereas the preferred one is not?
Experiment 3
We measured preference
for a new versus a classic option. Participants completed a task that
ostensibly would help local restaurateurs create new menus. One of the options
was labeled differently in the two conditions. That option was framed as either
classic, or new, an unexplored option (Eidelman et al., 2009). We predicted
that participants would choose the option framed as classic more when seated in
an orderly (vs. disorderly) room, and, conversely, that they would choose the
option framed as new more when seated in a disorderly (vs. orderly) room.
(31) Many questions could be asked at this point. I’ll just ask one: WTF? On the positive side, the sequence of experiments does bring back fond memories of Lazy Susan.
One hundred
eighty-eight American adults participated in a 2 (environmental orderliness:
orderly vs. disorderly) A 2 (label: classic vs. new) between-subjects
design.
(32) Who were these mysterious “American adults”? I assume they were not students, unless the authors got tired of typing “students.”
(33) How were they recruited?
(34) What was their age range?
(35) How many of them were male vs. female?
(36) Where were they tested? Were the rooms on a college
campus?
(37) How were they compensated?
We manipulated
environmental orderliness by randomly assigning participants to complete the
study in a room arranged to be orderly or disorderly (Fig. 3)
(38) What did the subjects say when they stepped into the
rooms on the right? Did they say If you want me to
participate in your experiment, can you first please clean up the mess or do you want me to hopscotch to my seat?
(39) Do the authors/reviewers/editor/journalists really
believe that orderliness was the only dimension on which these rooms varied?
The “disorderly” rooms look very staged for example. And the “orderly” rooms look
like the “disorderly” room of Experiment 1.
(40) Was there an effect of room? For example, one orderly room has a boombox whereas the other does not. One disorderly room has a
book weirdly placed behind the monitor whereas the other one has pencils strewn
all over the floor.
Participants imagined
that they were getting a fruit smoothie with a “boost” (i.e., additional
ingredients). Three types of boosts
were available: health, wellness, or vitamin.
(41) Didn’t the subjects have a problem performing this
task? Did anyone care to ask? Maybe I’m a particularly unimaginative guy but I don’t think
I could do a good job imagining a "fruit smoothie with a health boost." And how
is that different from one with a “wellness” or “vitamin” boost anyway?
We varied the framing
of the health-boost option so that it cued the concept of convention or novelty
(Fig. 4). To cue novelty, we added a star with the word new superimposed. To
cue convention, we added a star with the word classic superimposed. The
dependent measure was choice of the health-boost option.
(42) Were the authors confident that this manipulation of
room and the labels “classic” vs. “new” would yield a crossover interaction? I
guess they were but I wonder if anyone else would be, besides the reviewers and
editor of course.
Planned contrasts
supported our predictions (Fig. 5).
(43) No kidding. Evidently, the authors’ ability to create messy rooms is matched only by their ability to obtain perfect crossover interactions. Did the reviewers/editor not think that this interaction is, indeed, very very pretty?
(44) Was the pretest conducted in an orderly room? If so, it shows that there is no preference for label in an orderly room. Doesn't this contradict the main experiment, where a 35% vs. 17% preference was found for the classic label?
(43) No kidding. Evidently, the authors’ ability to create messy rooms is matched only by their ability to obtain perfect crossover interactions. Did the reviewers/editor not think that this interaction is, indeed, very very pretty?
(44) Was the pretest conducted in an orderly room? If so, it shows that there is no preference for label in an orderly room. Doesn't this contradict the main experiment, where a 35% vs. 17% preference was found for the classic label?
General discussion
Orderly environments
promote convention and healthy choices…
(45) Is it a healthy choice if someone selects an apple and
then doesn’t eat it? I guess you could call it that but it would be meaningless unless you're interested in demand effects.
Have the authors/reviewers/editor considered demand effects in any of these
experiments?
Our systematic
investigations revealed that both kinds of settings can enable people to
harness the power of these environments to achieve their goals.
(46) Did the reviewers/editor not think the authors grossly
overstated their results here?
(47) Did no one chuckle when reading about the power of these environments in
connection with the messy rooms?
One such person was
Einstein, who is widely reported to have observed, “If a cluttered desk is a
sign of a cluttered mind, of what, then, is an empty desk a sign?” (e.g.,
www.goodreads.com)
(48) Was it too much trouble to locate the source of this
quote?
Author contributions
Data collection and
analyses were overseen by all authors.
(49) How did this work if the data were collected at a
university that none of the authors are affiliated with? I’m sure it can be
done, but it would be important to know. And what does “overseen” mean here?
(50) And finally, does reading this article prime any thoughts
of Stapel and Smeesters?
I’m sure that there are a lot more questions that could be
asked about this research. My point is that they should have been asked and
answered by all concerned before the research was published and before big
claims about it were made in the media.
I hope no one will mind if I keep my office reasonably
orderly. I’m sure the lady who cleans my office won’t appreciate me ransacking
the place just so I can be more creative. And I don't think this study has convinced
me that it would matter anyway. In fact, I find my daughter’s hypothesis far
more compelling—and she didn’t even need imaginary smoothies, .8 effect sizes,
and perfect crossover interactions to convince me: Too much clutter is distracting. Papers on messiness are a case in
point.
I can't imagine anything as small as changing one word in the marketing blurb producing the magnitude of that crossover effect. Do the experimenters live in some fantasy world where people really pay attention to that level of consumerist detail? (Maybe people actually do work like that, in which case, I guess we're all screwed.)
BeantwoordenVerwijderenI can't imagine this either. I am beginning to think that some researchers and reviewers live in a fantasy world where such subtle manipulations and huge effects are the norm.
VerwijderenI must confess that I really liked the idea of this experiment. Unfortunately, however, after my own reading of the article I also felt disappointed. The major problem is that I do not understand how the authors concluded that tidy or messy environment is a better predictor for healthy choices, generosity, and conventionality than individual differences, for instance (it seems the authors do not speak about individual differences/preferences at all). For example, if I were a participant in that experiment, I would not choose a chocolate just because I don’t eat it (as you mentioned in p.8). Similarly, someone might prefer a bar of chocolate to an apple just because it (chocolate) is usually more expensive, and thus is a better reward for the experiment. In a word, there are many other factors that could explain the choice of one product over the other.
BeantwoordenVerwijderenHowever, the main reason I left this comment here was thank you for the time you devote writing your blogs. They are indeed very helpful for many young researchers like me (and not only the ones interested in embodied cognition) as they give really useful tips how to conduct solid experimental research.
Oleksandr.
Thank you very much, Oleksandr. I am glad you find the blog useful. I very much enjoy writing it.
VerwijderenMy feeling with a lot of these social priming experiments is that the idea does not seem implausible a priori. However, it is usually tested in a way that you think it is never going to work. So the results seem extremely implausible in light of the method and not necessarily because the hypothesis is weird.
Great post. I'd like to add that a Bayes factor analysis of those borderline p-values (.03<p<.05) is likely to reveal that the statistical evidence falls Jeffrey's category "not worth more than a bare mention".
BeantwoordenVerwijderenE.J.
Thanks EJ. It's a good point. My sense is that the study has bigger problems than the .03<p<.05, which are essentially meaningless anyway given all the confounds.
VerwijderenThis is a very good post. My enjoyment of it was spoiled by one thing, however. In question 40, the thing you call a "ghetto blaster" is more usually described as a "boom box" or "stereo." The word "ghetto" in this context evokes, at least for American readers, harshly negative racial and economic stereotypes of inner-city Black youth. You can get some sense of the range of meanings of the word at Urban Dictionary, a collaboratively-written slang dictionary: http://www.urbandictionary.com/define.php?term=ghetto
BeantwoordenVerwijderenI am going to hazard the guess that this is an inadvertent terminological misstep. To avoid giving offense, the word "ghetto" is best avoided in English except in a historical context discussing the forced segregation imposed on Jewish people in Europe before and during WWII.
Your point is well taken. I changed it to "boombox." I'm obviously familiar with that term but it is called a "ghetto blaster" in Dutch and this is probably why it made its way into the post.
VerwijderenAs someone whose parents were children in a city that was partly destroyed by the Nazis, I'm obviously not in need of a history lesson about WWII and as someone who has lived for 15 years below the Mason-Dixon line I am also familiar with America's racist history and present.
I think the publication of this study might actually provide a data point in favor of the authors' hypothesis: "Orderly [methods in scientific papers] promote conventional wisdom and healthy choices [on the part of reviewers and editors]" ... whereas papers that are a methodological mess apparently induce reviewers and editors to be swayed by novelty and willing to make unhealthy publication-related choices!
BeantwoordenVerwijderenVery nicely put!
VerwijderenExperiment 3:
BeantwoordenVerwijderen35% + 17% + 36% + 18% = 106% ???
Perhaps it is possible in some way to get to 104% due to rounding but +6% participants would be N=199.28. For N=188 these percentages give fractional cell counts.
"We performed a logistic regression with choice of the health boost as the dependent measure, and environmental orderliness and label as between-subject factors. The main effects were not significant (χ2s < 0.5), whereas the expected interaction was, χ2(1, N = 188) = 7.59, p < .01, ϕ = .20"
I'm a bit confused about this... a logistic regression was conducted but χ2 are reported? This could refer to a LR χ2 test of subsequent models, hence df=1, but it reads as if individual effects are reported... what were the parameter estimates? It is entirely possible that I missed a class, but what do the χ2 test statistics represent in the context of a logistic regression/ planned contrast?
---
By the way, a rough calculation of the incredibility index (Schimmack, 2102) of this paper is (using Exp. 1 and 2):
Number of sig. results for core predictions = 100%
Average Power based on reported d: 0.745
Probability of making type 1 error for the 4 tests that reported d: alpha err. prob = 0.05^4 = .00000625
Total power: ability to detect an effect of 0.745 at err. prob. .00000625 with average N=60 is (1-beta) = 0.03
Incredibility index = 97%
97% of studies with these stats would have found at least 1 nonsignificant result.
Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological methods, 17(4), 551–66. doi:10.1037/a0029487
I don't think the percentages you mention should be added up. The way I understand it (I agree it is not described very clearly in the paper), subjects could choose one of three options with only one of the options being labeled. So in the orderly room, 52% of the subjects selected the labeled option and 48% selected one of the two other options. In the disorderly room, 54% selected the labeled option and 46% selected one of the two other options. What's really puzzling about these percentages is what I describe in Question 44.
VerwijderenOf course! I was initially trying to reconstruct the 2x2 table that I thought the χ2 was based on, that's why I lumped them together.
VerwijderenSo just about half of the 188 participants chose the labelled option even thought it was made salient and appeared as the first choice.
Q.44 is indeed crucial
"what do the χ2 test statistics represent in the context of a logistic regression/ planned contrast?"
VerwijderenThey are most likely either Wald chi-squares (i.e., the square of the Wald z-statistics), or likelihood-ratio chi-squares from nested model comparisons, as you noted. My guess is that they are the former, but it's not entirely clear. Anyway, neither of these would be unusual or surprising for logistic regression. So I guess I am not sure what you think is strange about the reporting of chi-squares here.
The percentages Rolf discusses above are the percentages for the pretest, where participants had to rate two differently labelled options. However, the way I understand it, the percentages shown in Figure 5 represent the percentage of participants in each group choosing the labelled option. That is, if I am not mistaken, it seems only the health boost option was labelled in the experiment and the other options were not. If this was the case, then the option that was made more sallient with either a novelty or classic cue was selected by just 26,5% of participants averaged over groups. Which is also a novel and interesting finding.
BeantwoordenVerwijderenHi Rolf,
BeantwoordenVerwijderenVery interesting stuff. Have you received any contact from the authors of this study? Do you know if they are aware of your critique/questions? I would love to hear their responses.
Hi Chris,
VerwijderenThanks! I haven't heard from the authors. This would be interesting of course. Even more interesting might be to hear from the editor and/or reviewers; of course I don't know who they are. I suspect they are all aware of the questions, as the post has received thousands of views and has made the rounds on the social media. We'll see what the future will bring. I did receive emails from various well known social (and cognitive psychologists) who very much agreed with me.
My biggest question is how could you test 3 hypothesizes with only 3 experiments and publish it on PS.
BeantwoordenVerwijderenPiggybacking on your point about possible confounds in the rooms, it seems fairly clear that "room" is a random effect, but they are treating it as a fixed effect. The effective sample size in the "orderly" versus "disorderly" groups are thus N=1 for each; thus, running more participants may tell us a lot about these two rooms, but next to nothing about the populations of "orderly" versus "disorderly" rooms. The problem of treating a random effect as fixed is a well-known methodological issue that a reviewer/editor should have picked up on.
BeantwoordenVerwijderenI am a researcher on a completely different field. My desk tends to sometimes look very messy, but it is my own mess. I think my own mess frequently helps with the thought process because the apparent mess is full of cues that are meaningful to me, and related to what I am working at the time. A mess staged by an outsider may carry or not information meaningful to the subjects, a mess that one has built oneself can be a very rich source of information. So replication should I think not only include enough subjects, but also different kinds of messy and tidy arrangements/environments to which each subject would be exposed. Or is the comparison between messy vs. tidy meaningful at all?
BeantwoordenVerwijderen