Friday, August 16, 2013

50 Questions About Messy Rooms and Clean Data

About a month ago, I had a difficult conversation with my daughter. Her year in college had not gone particularly well and I asked her what she was going to do differently next year. One of the first things she was going to do, she said, was to clean up her student room. It was just too cluttered to concentrate.

Not only 20-year-old students hypothesize about the effects of environments on thought; social psychologists do too. My daughter’s hypothesis is straightforward: messy environments are distracting. The social psychologists’ hypotheses take us a little further afield. For example: Messy environments promote stereotyping. The paper describing research into this hypothesis was co-authored by Diederik Stapel and has been retracted. Another hypothesis is that messy environments promote a longing for simplicity. The paper describing research into this hypothesis was co-authored by Dirk Smeesters and has been retracted.

Now there is a new study on messiness. It is about to be published in Psychological Science and has already received a lot of press coverage. The main findings are claimed to be that neat environments promote giving to charity and healthy eating behavior whereas messy environments promote creativity.

While I was reading the article, many questions arose. Given their obviousness, I’m surprised that these questions did not occur to the researchers who wrote the paper, the reviewers who commented on the manuscript, the editor who accepted the manuscript for publication, and the journalists who wrote breathless news stories about it. So in the rest of this post I’m just going to list these questions. I will not focus on theoretical aspects of the study (or the lack thereof), which would have made the list even longer.

My questions follow the structure of the paper.

Experiment 1

Thirty-four Dutch students participated. They were randomly assigned to an orderly or a disorderly condition.

(1) Isn’t 34 a small N for a between-subjects design with a subtle manipulation?
(2) At what university were these students? The authors are at the University of Minnesota. (I learned via Twitter that the subjects were likely run at Radboud University in Nijmegen, a university that the authors are not affiliated with.)
(3) How many male and female students were in the sample?
(4) Is there any additional information on the subjects that might be relevant?

We manipulated environmental orderliness by having participants complete the study in an orderly or disorderly room (Fig. 1).

(5) Doesn’t the “orderly room” look like a testing room and the “disorderly room” like someone’s office? Is, in other words, orderliness the only thing that is varied between conditions or are there one or more confounds? 

Participants wrote the amount, if any, they chose to donate on a sheet of paper, which they placed into a sealed envelope (so that self-presentation concerns would be dispelled).

(6) Did the subjects actually donate the money? If so, how was this accomplished?

Upon exiting, participants were allowed to take an apple or chocolate bar, which constituted the measure of healthy food choice.

(7) What was the motivation for using these particular snacks?
(8) Weren’t the authors worried that some people may never eat chocolate whereas others never eat apples?
(9) Did the authors have independent information on the subjects’ snack preferences?

Participants who completed the study in the orderly room donated more than twice as much as those who completed the study in the disorderly room (M  = €3.19, SD  = 3.01, vs. M  = €1.29, SD  = 1.76), t (32) = 2.24, p  = .03, d  = 0.73. Fully 82% of participants in the orderly room donated some money, versus 47% in the disorderly room, χ2 (1, N  = 34) = 4.64, p  < .04, ϕ  = .37.

(10) Didn’t the authors/reviewers/editor find this a surprisingly strong effect for such a small sample and such a subtle manipulation?

Also as predicted, participants in the orderly room chose the apple (over the chocolate) more often than those in the disorderly room1  (M  = 67% vs.
M  = 20%), χ2 (1, N  = 30) = 6.65, p  < .05, ϕ  = .44.

(11) See previous question.
(12) How can the authors be sure that out of 30 subjects randomly assigned to two conditions the numbers who normally prefer apples over chocolate were about equal before the manipulation? Weren’t they worried that an over-representation of apple lovers in the disorderly room would destroy their hypothesized effect? If not, why were they unconcerned about this?
(13) What did the subjects do with the snacks? Eat them? Give them away? Dump them in the trash?
(14) Does it count as a healthy choice if someone selects an apple but then doesn’t eat it?
(15) How were the snacks presented? Was the chocolate in a wrapper? And how about the apple?
(16) What would have happened if more subjects in the orderly room had selected the chocolate? Would the authors have post-hoc hypothesized that some compensatory mechanism was at work? (Chocolate counteracts the effects of being in sterile environments.)
(17) Was no one concerned that the donation task might influence the snack-selection task?
(18) Was no one concerned about demand effects?
(19) Were these two tasks the only tasks that were performed?

Experiment 2

Given that orderliness is paired with valuing convention, a disorderly state should encourage breaking with convention, which is needed to be creative (Simonton, 1999). Therefore, we predicted that being in a disorderly environment would have the desirable effect of stimulating creativity.

(20) Did the reviewers/editor consider this a convincing rationale for the prediction?

Forty-eight American students participated in a two-condition (orderly vs. disorderly environment) design.

(21) What was these students’ affiliation?
(22) How many males vs. females were in the experiment?

Participants completed tasks in a room arranged to be either orderly or disorderly (Fig. 2).

(23) Are the authors/reviewers/editor/journalists serious? Is this really the same manipulation of orderliness as in Experiment 1? The room looks orderly alright in the picture on the left but on the right it looks like some errant groundskeeper had just wandered in with a leaf blower on at full blast.
(24) Do de authors/reviewers/editor seriously believe that orderliness is the only dimension along which the two rooms differ? Are there no confounds?
(25) What did the subjects say upon entering the disorderly room? Did they perchance say Is this a practical joke? In others words, did they take the experiment seriously? As seriously at least as those in the orderly room?

Two coders, blind to condition, rated each idea on a 3-point scale (1 = not at all creative , 3 = very creative ; κ  = .81, p  < .01); disagreements were resolved through discussion.

(26) What were the criteria that were used by the raters? What is an example of a “very creative” idea?

Results (all effects were significant and effect sizes were large)

(27) Was nobody surprised about this? Not the authors, not the reviewers, and not the editor?


It could be that our disorderly laboratory violated participants’ expectations

(28) Was this sentence included for comical effect? If so, it worked.

Our preferred explanation, though, is that cues of disorder can produce creativity because they inspire breaking free of convention

(29) Did the reviewers/editor consider this a satisfactory explanation? I mean, when it comes to ice cream flavors, I prefer pistachio to strawberry. Of course, the ice cream vendor doesn’t demand an explanation; he’s just as happy to sell me the pistachio as he is to sell me the strawberry. We’re talking not about ice cream flavors here, though, but about science, so shouldn't people be held to a higher standard than merely stating their preferences?
(30) Didn’t anyone find it ironic that the alternative explanation is supported with a reference whereas the preferred one is not?

Experiment 3

We measured preference for a new versus a classic option. Participants completed a task that ostensibly would help local restaurateurs create new menus. One of the options was labeled differently in the two conditions. That option was framed as either classic, or new, an unexplored option (Eidelman et al., 2009). We predicted that participants would choose the option framed as classic more when seated in an orderly (vs. disorderly) room, and, conversely, that they would choose the option framed as new more when seated in a disorderly (vs. orderly) room.

(31) Many questions could be asked at this point. I’ll just ask one: WTF? On the positive side, the sequence of experiments does bring back fond memories of Lazy Susan.

One hundred eighty-eight American adults participated in a 2 (environmental orderliness: orderly vs. disorderly) A 2 (label: classic vs. new) between-subjects design.

(32) Who were these mysterious “American adults”? I assume they were not students, unless the authors got tired of typing “students.”
(33) How were they recruited?
(34) What was their age range?
(35) How many of them were male vs. female?
(36) Where were they tested? Were the rooms on a college campus?
(37) How were they compensated?

We manipulated environmental orderliness by randomly assigning participants to complete the study in a room arranged to be orderly or disorderly (Fig. 3)

(38) What did the subjects say when they stepped into the rooms on the right? Did they say If you want me to participate in your experiment, can you first please clean up the mess or do you want me to hopscotch to my seat?
(39) Do the authors/reviewers/editor/journalists really believe that orderliness was the only dimension on which these rooms varied? The “disorderly” rooms look very staged for example. And the “orderly” rooms look like the “disorderly” room of Experiment 1.
(40) Was there an effect of room? For example, one orderly room has a boombox whereas the other does not. One disorderly room has a book weirdly placed behind the monitor whereas the other one has pencils strewn all over the floor. 

Participants imagined that they were getting a fruit smoothie with a “boost” (i.e., additional ingredients). Three types of boosts were available: health, wellness, or vitamin.

(41) Didn’t the subjects have a problem performing this task? Did anyone care to ask? Maybe I’m a particularly unimaginative guy but I don’t think I could do a good job imagining a "fruit smoothie with a health boost." And how is that different from one with a “wellness” or “vitamin” boost anyway?

We varied the framing of the health-boost option so that it cued the concept of convention or novelty (Fig. 4). To cue novelty, we added a star with the word new superimposed. To cue convention, we added a star with the word classic superimposed. The dependent measure was choice of the health-boost option.

(42) Were the authors confident that this manipulation of room and the labels “classic” vs. “new” would yield a crossover interaction? I guess they were but I wonder if anyone else would be, besides the reviewers and editor of course.

Planned contrasts supported our predictions (Fig. 5).

(43) No kidding. Evidently, the authors’ ability to create messy rooms is matched only by their ability to obtain perfect crossover interactions. Did the reviewers/editor not think that this interaction is, indeed, very very pretty?
(44) Was the pretest conducted in an orderly room? If so, it shows that there is no preference for label in an orderly room. Doesn't this contradict the main experiment, where a 35% vs. 17% preference was found for the classic label?

General discussion

Orderly environments promote convention and healthy choices…

(45) Is it a healthy choice if someone selects an apple and then doesn’t eat it? I guess you could call it that but it would be meaningless unless you're interested in demand effects. Have the authors/reviewers/editor considered demand effects in any of these experiments?

Our systematic investigations revealed that both kinds of settings can enable people to harness the power of these environments to achieve their goals.

(46) Did the reviewers/editor not think the authors grossly overstated their results here?
(47) Did no one chuckle when reading about the power of these environments in connection with the messy rooms?

One such person was Einstein, who is widely reported to have observed, “If a cluttered desk is a sign of a cluttered mind, of what, then, is an empty desk a sign?” (e.g.,

(48) Was it too much trouble to locate the source of this quote?

Author contributions

Data collection and analyses were overseen by all authors.

(49) How did this work if the data were collected at a university that none of the authors are affiliated with? I’m sure it can be done, but it would be important to know. And what does “overseen” mean here?
(50) And finally, does reading this article prime any thoughts of Stapel and Smeesters?

I’m sure that there are a lot more questions that could be asked about this research. My point is that they should have been asked and answered by all concerned before the research was published and before big claims about it were made in the media.

I hope no one will mind if I keep my office reasonably orderly. I’m sure the lady who cleans my office won’t appreciate me ransacking the place just so I can be more creative. And I don't think this study has convinced me that it would matter anyway. In fact, I find my daughter’s hypothesis far more compelling—and she didn’t even need imaginary smoothies, .8 effect sizes, and perfect crossover interactions to convince me: Too much clutter is distracting. Papers on messiness are a case in point.


  1. I can't imagine anything as small as changing one word in the marketing blurb producing the magnitude of that crossover effect. Do the experimenters live in some fantasy world where people really pay attention to that level of consumerist detail? (Maybe people actually do work like that, in which case, I guess we're all screwed.)

    1. I can't imagine this either. I am beginning to think that some researchers and reviewers live in a fantasy world where such subtle manipulations and huge effects are the norm.

  2. I must confess that I really liked the idea of this experiment. Unfortunately, however, after my own reading of the article I also felt disappointed. The major problem is that I do not understand how the authors concluded that tidy or messy environment is a better predictor for healthy choices, generosity, and conventionality than individual differences, for instance (it seems the authors do not speak about individual differences/preferences at all). For example, if I were a participant in that experiment, I would not choose a chocolate just because I don’t eat it (as you mentioned in p.8). Similarly, someone might prefer a bar of chocolate to an apple just because it (chocolate) is usually more expensive, and thus is a better reward for the experiment. In a word, there are many other factors that could explain the choice of one product over the other.
    However, the main reason I left this comment here was thank you for the time you devote writing your blogs. They are indeed very helpful for many young researchers like me (and not only the ones interested in embodied cognition) as they give really useful tips how to conduct solid experimental research.

    1. Thank you very much, Oleksandr. I am glad you find the blog useful. I very much enjoy writing it.

      My feeling with a lot of these social priming experiments is that the idea does not seem implausible a priori. However, it is usually tested in a way that you think it is never going to work. So the results seem extremely implausible in light of the method and not necessarily because the hypothesis is weird.

  3. Great post. I'd like to add that a Bayes factor analysis of those borderline p-values (.03<p<.05) is likely to reveal that the statistical evidence falls Jeffrey's category "not worth more than a bare mention".

    1. Thanks EJ. It's a good point. My sense is that the study has bigger problems than the .03<p<.05, which are essentially meaningless anyway given all the confounds.

  4. This is a very good post. My enjoyment of it was spoiled by one thing, however. In question 40, the thing you call a "ghetto blaster" is more usually described as a "boom box" or "stereo." The word "ghetto" in this context evokes, at least for American readers, harshly negative racial and economic stereotypes of inner-city Black youth. You can get some sense of the range of meanings of the word at Urban Dictionary, a collaboratively-written slang dictionary:

    I am going to hazard the guess that this is an inadvertent terminological misstep. To avoid giving offense, the word "ghetto" is best avoided in English except in a historical context discussing the forced segregation imposed on Jewish people in Europe before and during WWII.

    1. Your point is well taken. I changed it to "boombox." I'm obviously familiar with that term but it is called a "ghetto blaster" in Dutch and this is probably why it made its way into the post.

      As someone whose parents were children in a city that was partly destroyed by the Nazis, I'm obviously not in need of a history lesson about WWII and as someone who has lived for 15 years below the Mason-Dixon line I am also familiar with America's racist history and present.

  5. I think the publication of this study might actually provide a data point in favor of the authors' hypothesis: "Orderly [methods in scientific papers] promote conventional wisdom and healthy choices [on the part of reviewers and editors]" ... whereas papers that are a methodological mess apparently induce reviewers and editors to be swayed by novelty and willing to make unhealthy publication-related choices!

  6. Experiment 3:

    35% + 17% + 36% + 18% = 106% ???

    Perhaps it is possible in some way to get to 104% due to rounding but +6% participants would be N=199.28. For N=188 these percentages give fractional cell counts.

    "We performed a logistic regression with choice of the health boost as the dependent measure, and environmental orderliness and label as between-subject factors. The main effects were not significant (χ2s < 0.5), whereas the expected interaction was, χ2(1, N = 188) = 7.59, p < .01, ϕ = .20"

    I'm a bit confused about this... a logistic regression was conducted but χ2 are reported? This could refer to a LR χ2 test of subsequent models, hence df=1, but it reads as if individual effects are reported... what were the parameter estimates? It is entirely possible that I missed a class, but what do the χ2 test statistics represent in the context of a logistic regression/ planned contrast?

    By the way, a rough calculation of the incredibility index (Schimmack, 2102) of this paper is (using Exp. 1 and 2):

    Number of sig. results for core predictions = 100%
    Average Power based on reported d: 0.745

    Probability of making type 1 error for the 4 tests that reported d: alpha err. prob = 0.05^4 = .00000625

    Total power: ability to detect an effect of 0.745 at err. prob. .00000625 with average N=60 is (1-beta) = 0.03

    Incredibility index = 97%
    97% of studies with these stats would have found at least 1 nonsignificant result.

    Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological methods, 17(4), 551–66. doi:10.1037/a0029487

    1. I don't think the percentages you mention should be added up. The way I understand it (I agree it is not described very clearly in the paper), subjects could choose one of three options with only one of the options being labeled. So in the orderly room, 52% of the subjects selected the labeled option and 48% selected one of the two other options. In the disorderly room, 54% selected the labeled option and 46% selected one of the two other options. What's really puzzling about these percentages is what I describe in Question 44.

    2. Of course! I was initially trying to reconstruct the 2x2 table that I thought the χ2 was based on, that's why I lumped them together.

      So just about half of the 188 participants chose the labelled option even thought it was made salient and appeared as the first choice.

      Q.44 is indeed crucial

    3. "what do the χ2 test statistics represent in the context of a logistic regression/ planned contrast?"

      They are most likely either Wald chi-squares (i.e., the square of the Wald z-statistics), or likelihood-ratio chi-squares from nested model comparisons, as you noted. My guess is that they are the former, but it's not entirely clear. Anyway, neither of these would be unusual or surprising for logistic regression. So I guess I am not sure what you think is strange about the reporting of chi-squares here.

  7. The percentages Rolf discusses above are the percentages for the pretest, where participants had to rate two differently labelled options. However, the way I understand it, the percentages shown in Figure 5 represent the percentage of participants in each group choosing the labelled option. That is, if I am not mistaken, it seems only the health boost option was labelled in the experiment and the other options were not. If this was the case, then the option that was made more sallient with either a novelty or classic cue was selected by just 26,5% of participants averaged over groups. Which is also a novel and interesting finding.

  8. Hi Rolf,

    Very interesting stuff. Have you received any contact from the authors of this study? Do you know if they are aware of your critique/questions? I would love to hear their responses.

    1. Hi Chris,

      Thanks! I haven't heard from the authors. This would be interesting of course. Even more interesting might be to hear from the editor and/or reviewers; of course I don't know who they are. I suspect they are all aware of the questions, as the post has received thousands of views and has made the rounds on the social media. We'll see what the future will bring. I did receive emails from various well known social (and cognitive psychologists) who very much agreed with me.

  9. My biggest question is how could you test 3 hypothesizes with only 3 experiments and publish it on PS.

  10. Here's another question: were the subjects pre-screened to determine their typical habits or preferences? I am a messy desk person. If I were put into the orderly room arm, I might very behave quite differently than others who prefer that condition. More important, the neatniks who ended up in the messy condition must have really had a tough time focusing on their task. You can't test the psychological effects of a condition like this without recognizing that subjects arrive with their own natural context!

  11. Piggybacking on your point about possible confounds in the rooms, it seems fairly clear that "room" is a random effect, but they are treating it as a fixed effect. The effective sample size in the "orderly" versus "disorderly" groups are thus N=1 for each; thus, running more participants may tell us a lot about these two rooms, but next to nothing about the populations of "orderly" versus "disorderly" rooms. The problem of treating a random effect as fixed is a well-known methodological issue that a reviewer/editor should have picked up on.