50 Questions About Messy Rooms and Clean Data

Update April 3, 2025

This blog, Drang naar Samenhang will feature posts in Dutch from now on—but no worries, English speakers, I’ve got you covered too. I have launched Substack newsletter called Craving Coherence: https://rolfzwaan.substack.com.

You don’t need to subscribe to read the posts—just hit “No thanks” if prompted. Of course, I’d really appreciate it if you do sign up. It’s completely free!

So what is the newsletter about?

Why do we search for patterns, craft narratives, and cling to meaning? Craving Coherence explores the psychology of understanding—the mental shortcuts, biases, and frameworks that shape how we interpret reality. From cognitive science to philosophy, this newsletter examines how our minds construct coherence in an often chaotic world—and what happens when they fail.

I hope to see you there!

Back to the original post:

About a month ago, I had a difficult conversation with my daughter. Her year in college had not gone particularly well and I asked her what she was going to do differently next year. One of the first things she was going to do, she said, was to clean up her student room. It was just too cluttered to concentrate.

Not only 20-year-old students hypothesize about the effects of environments on thought; social psychologists do too. My daughter’s hypothesis is straightforward: messy environments are distracting. The social psychologists’ hypotheses take us a little further afield. For example: Messy environments promote stereotyping. The paper describing research into this hypothesis was co-authored by Diederik Stapel and has been retracted. Another hypothesis is that messy environments promote a longing for simplicity. The paper describing research into this hypothesis was co-authored by Dirk Smeesters and has been retracted.

Now there is a new study on messiness. It is about to be published in Psychological Science and has already received a lot of press coverage. The main findings are claimed to be that neat environments promote giving to charity and healthy eating behavior whereas messy environments promote creativity.

While I was reading the article, many questions arose. Given their obviousness, I’m surprised that these questions did not occur to the researchers who wrote the paper, the reviewers who commented on the manuscript, the editor who accepted the manuscript for publication, and the journalists who wrote breathless news stories about it. So in the rest of this post I’m just going to list these questions. I will not focus on theoretical aspects of the study (or the lack thereof), which would have made the list even longer.

My questions follow the structure of the paper.

Experiment 1

Thirty-four Dutch students participated. They were randomly assigned to an orderly or a disorderly condition.

(1) Isn’t 34 a small N for a between-subjects design with a subtle manipulation?

(2) At what university were these students? The authors are at the University of Minnesota. (I learned via Twitter that the subjects were likely run at Radboud University in Nijmegen, a university that the authors are not affiliated with.)

(3) How many male and female students were in the sample?

(4) Is there any additional information on the subjects that might be relevant?

We manipulated environmental orderliness by having participants complete the study in an orderly or disorderly room (Fig. 1).

(5) Doesn’t the “orderly room” look like a testing room and the “disorderly room” like someone’s office? Is, in other words, orderliness the only thing that is varied between conditions or are there one or more confounds?

Participants wrote the amount, if any, they chose to donate on a sheet of paper, which they placed into a sealed envelope (so that self-presentation concerns would be dispelled).

(6) Did the subjects actually donate the money? If so, how was this accomplished?

Upon exiting, participants were allowed to take an apple or chocolate bar, which constituted the measure of healthy food choice.

(7) What was the motivation for using these particular snacks?

(8) Weren’t the authors worried that some people may never eat chocolate whereas others never eat apples?

(9) Did the authors have independent information on the subjects’ snack preferences?

Participants who completed the study in the orderly room donated more than twice as much as those who completed the study in the disorderly room (M = €3.19, SD = 3.01, vs. M = €1.29, SD = 1.76), t (32) = 2.24, p = .03, d = 0.73. Fully 82% of participants in the orderly room donated some money, versus 47% in the disorderly room, χ2 (1, N = 34) = 4.64, p < .04, ϕ = .37.

(10) Didn’t the authors/reviewers/editor find this a surprisingly strong effect for such a small sample and such a subtle manipulation?

Also as predicted, participants in the orderly room chose the apple (over the chocolate) more often than those in the disorderly room1 (M = 67% vs.

M = 20%), χ2 (1, N = 30) = 6.65, p < .05, ϕ = .44.

(11) See previous question.

(12) How can the authors be sure that out of 30 subjects randomly assigned to two conditions the numbers who normally prefer apples over chocolate were about equal before the manipulation? Weren’t they worried that an over-representation of apple lovers in the disorderly room would destroy their hypothesized effect? If not, why were they unconcerned about this?

(13) What did the subjects do with the snacks? Eat them? Give them away? Dump them in the trash?

(14) Does it count as a healthy choice if someone selects an apple but then doesn’t eat it?

(15) How were the snacks presented? Was the chocolate in a wrapper? And how about the apple?

(16) What would have happened if more subjects in the orderly room had selected the chocolate? Would the authors have post-hoc hypothesized that some compensatory mechanism was at work? (Chocolate counteracts the effects of being in sterile environments.)

(17) Was no one concerned that the donation task might influence the snack-selection task?

(18) Was no one concerned about demand effects?

(19) Were these two tasks the only tasks that were performed?

Experiment 2

Given that orderliness is paired with valuing convention, a disorderly state should encourage breaking with convention, which is needed to be creative (Simonton, 1999). Therefore, we predicted that being in a disorderly environment would have the desirable effect of stimulating creativity.

(20) Did the reviewers/editor consider this a convincing rationale for the prediction?

Forty-eight American students participated in a two-condition (orderly vs. disorderly environment) design.

(21) What was these students’ affiliation?

(22) How many males vs. females were in the experiment?

Participants completed tasks in a room arranged to be either orderly or disorderly (Fig. 2).

(23) Are the authors/reviewers/editor/journalists serious? Is this really the same manipulation of orderliness as in Experiment 1? The room looks orderly alright in the picture on the left but on the right it looks like some errant groundskeeper had just wandered in with a leaf blower on at full blast.

(24) Do de authors/reviewers/editor seriously believe that orderliness is the only dimension along which the two rooms differ? Are there no confounds?
(25) What did the subjects say upon entering the disorderly room? Did they perchance say Is this a practical joke? In others words, did they take the experiment seriously? As seriously at least as those in the orderly room?

Two coders, blind to condition, rated each idea on a 3-point scale (1 = not at all creative , 3 = very creative ; κ = .81, p < .01); disagreements were resolved through discussion.

(26) What were the criteria that were used by the raters? What is an example of a “very creative” idea?

Results (all effects were significant and effect sizes were large)

(27) Was nobody surprised about this? Not the authors, not the reviewers, and not the editor?

Discussion

It could be that our disorderly laboratory violated participants’ expectations

(28) Was this sentence included for comical effect? If so, it worked.

Our preferred explanation, though, is that cues of disorder can produce creativity because they inspire breaking free of convention

(29) Did the reviewers/editor consider this a satisfactory explanation? I mean, when it comes to ice cream flavors, I prefer pistachio to strawberry. Of course, the ice cream vendor doesn’t demand an explanation; he’s just as happy to sell me the pistachio as he is to sell me the strawberry. We’re talking not about ice cream flavors here, though, but about science, so shouldn't people be held to a higher standard than merely stating their preferences?

(30) Didn’t anyone find it ironic that the alternative explanation is supported with a reference whereas the preferred one is not?

Experiment 3

We measured preference for a new versus a classic option. Participants completed a task that ostensibly would help local restaurateurs create new menus. One of the options was labeled differently in the two conditions. That option was framed as either classic, or new, an unexplored option (Eidelman et al., 2009). We predicted that participants would choose the option framed as classic more when seated in an orderly (vs. disorderly) room, and, conversely, that they would choose the option framed as new more when seated in a disorderly (vs. orderly) room.

(31) Many questions could be asked at this point. I’ll just ask one: WTF? On the positive side, the sequence of experiments does bring back fond memories of Lazy Susan.

One hundred eighty-eight American adults participated in a 2 (environmental orderliness: orderly vs. disorderly) A 2 (label: classic vs. new) between-subjects design.

(32) Who were these mysterious “American adults”? I assume they were not students, unless the authors got tired of typing “students.”

(33) How were they recruited?

(34) What was their age range?

(35) How many of them were male vs. female?

(36) Where were they tested? Were the rooms on a college campus?

(37) How were they compensated?

We manipulated environmental orderliness by randomly assigning participants to complete the study in a room arranged to be orderly or disorderly (Fig. 3)

(38) What did the subjects say when they stepped into the rooms on the right? Did they say If you want me to participate in your experiment, can you first please clean up the mess or do you want me to hopscotch to my seat?

(39) Do the authors/reviewers/editor/journalists really believe that orderliness was the only dimension on which these rooms varied? The “disorderly” rooms look very staged for example. And the “orderly” rooms look like the “disorderly” room of Experiment 1.

(40) Was there an effect of room? For example, one orderly room has a boombox whereas the other does not. One disorderly room has a book weirdly placed behind the monitor whereas the other one has pencils strewn all over the floor.

Participants imagined that they were getting a fruit smoothie with a “boost” (i.e., additional ingredients). Three types of boosts were available: health, wellness, or vitamin.

(41) Didn’t the subjects have a problem performing this task? Did anyone care to ask? Maybe I’m a particularly unimaginative guy but I don’t think I could do a good job imagining a "fruit smoothie with a health boost." And how is that different from one with a “wellness” or “vitamin” boost anyway?

We varied the framing of the health-boost option so that it cued the concept of convention or novelty (Fig. 4). To cue novelty, we added a star with the word new superimposed. To cue convention, we added a star with the word classic superimposed. The dependent measure was choice of the health-boost option.

(42) Were the authors confident that this manipulation of room and the labels “classic” vs. “new” would yield a crossover interaction? I guess they were but I wonder if anyone else would be, besides the reviewers and editor of course.

Planned contrasts supported our predictions (Fig. 5).

(43) No kidding. Evidently, the authors’ ability to create messy rooms is matched only by their ability to obtain perfect crossover interactions. Did the reviewers/editor not think that this interaction is, indeed, very very pretty?
(44) Was the pretest conducted in an orderly room? If so, it shows that there is no preference for label in an orderly room. Doesn't this contradict the main experiment, where a 35% vs. 17% preference was found for the classic label?

General discussion

Orderly environments promote convention and healthy choices…

(45) Is it a healthy choice if someone selects an apple and then doesn’t eat it? I guess you could call it that but it would be meaningless unless you're interested in demand effects. Have the authors/reviewers/editor considered demand effects in any of these experiments?

Our systematic investigations revealed that both kinds of settings can enable people to harness the power of these environments to achieve their goals.

(46) Did the reviewers/editor not think the authors grossly overstated their results here?

(47) Did no one chuckle when reading about the power of these environments in connection with the messy rooms?

One such person was Einstein, who is widely reported to have observed, “If a cluttered desk is a sign of a cluttered mind, of what, then, is an empty desk a sign?” (e.g., www.goodreads.com)

(48) Was it too much trouble to locate the source of this quote?

Author contributions

Data collection and analyses were overseen by all authors.

(49) How did this work if the data were collected at a university that none of the authors are affiliated with? I’m sure it can be done, but it would be important to know. And what does “overseen” mean here?

(50) And finally, does reading this article prime any thoughts of Stapel and Smeesters?

I’m sure that there are a lot more questions that could be asked about this research. My point is that they should have been asked and answered by all concerned before the research was published and before big claims about it were made in the media.

I hope no one will mind if I keep my office reasonably orderly. I’m sure the lady who cleans my office won’t appreciate me ransacking the place just so I can be more creative. And I don't think this study has convinced me that it would matter anyway. In fact, I find my daughter’s hypothesis far more compelling—and she didn’t even need imaginary smoothies, .8 effect sizes, and perfect crossover interactions to convince me: Too much clutter is distracting. Papers on messiness are a case in point.

Reacties

Nick Brown16 augustus 2013 om 18:51
I can't imagine anything as small as changing one word in the marketing blurb producing the magnitude of that crossover effect. Do the experimenters live in some fantasy world where people really pay attention to that level of consumerist detail? (Maybe people actually do work like that, in which case, I guess we're all screwed.)
BeantwoordenVerwijderen
Reacties
Unknown16 augustus 2013 om 20:53
I must confess that I really liked the idea of this experiment. Unfortunately, however, after my own reading of the article I also felt disappointed. The major problem is that I do not understand how the authors concluded that tidy or messy environment is a better predictor for healthy choices, generosity, and conventionality than individual differences, for instance (it seems the authors do not speak about individual differences/preferences at all). For example, if I were a participant in that experiment, I would not choose a chocolate just because I don’t eat it (as you mentioned in p.8). Similarly, someone might prefer a bar of chocolate to an apple just because it (chocolate) is usually more expensive, and thus is a better reward for the experiment. In a word, there are many other factors that could explain the choice of one product over the other.
However, the main reason I left this comment here was thank you for the time you devote writing your blogs. They are indeed very helpful for many young researchers like me (and not only the ones interested in embodied cognition) as they give really useful tips how to conduct solid experimental research.
Oleksandr.
BeantwoordenVerwijderen
Reacties
EJ16 augustus 2013 om 22:38
Great post. I'd like to add that a Bayes factor analysis of those borderline p-values (.03<p<.05) is likely to reveal that the statistical evidence falls Jeffrey's category "not worth more than a bare mention".
E.J.
BeantwoordenVerwijderen
Reacties
Aaron Ecay17 augustus 2013 om 00:45
This is a very good post. My enjoyment of it was spoiled by one thing, however. In question 40, the thing you call a "ghetto blaster" is more usually described as a "boom box" or "stereo." The word "ghetto" in this context evokes, at least for American readers, harshly negative racial and economic stereotypes of inner-city Black youth. You can get some sense of the range of meanings of the word at Urban Dictionary, a collaboratively-written slang dictionary: http://www.urbandictionary.com/define.php?term=ghetto

I am going to hazard the guess that this is an inadvertent terminological misstep. To avoid giving offense, the word "ghetto" is best avoided in English except in a historical context discussing the forced segregation imposed on Jewish people in Europe before and during WWII.
BeantwoordenVerwijderen
Reacties
Kara17 augustus 2013 om 05:30
I think the publication of this study might actually provide a data point in favor of the authors' hypothesis: "Orderly [methods in scientific papers] promote conventional wisdom and healthy choices [on the part of reviewers and editors]" ... whereas papers that are a methodological mess apparently induce reviewers and editors to be swayed by novelty and willing to make unhealthy publication-related choices!
BeantwoordenVerwijderen
Reacties
Unknown18 augustus 2013 om 03:15
Experiment 3:

35% + 17% + 36% + 18% = 106% ???

Perhaps it is possible in some way to get to 104% due to rounding but +6% participants would be N=199.28. For N=188 these percentages give fractional cell counts.

"We performed a logistic regression with choice of the health boost as the dependent measure, and environmental orderliness and label as between-subject factors. The main effects were not significant (χ2s < 0.5), whereas the expected interaction was, χ2(1, N = 188) = 7.59, p < .01, ϕ = .20"

I'm a bit confused about this... a logistic regression was conducted but χ2 are reported? This could refer to a LR χ2 test of subsequent models, hence df=1, but it reads as if individual effects are reported... what were the parameter estimates? It is entirely possible that I missed a class, but what do the χ2 test statistics represent in the context of a logistic regression/ planned contrast?

---
By the way, a rough calculation of the incredibility index (Schimmack, 2102) of this paper is (using Exp. 1 and 2):

Number of sig. results for core predictions = 100%
Average Power based on reported d: 0.745

Probability of making type 1 error for the 4 tests that reported d: alpha err. prob = 0.05^4 = .00000625

Total power: ability to detect an effect of 0.745 at err. prob. .00000625 with average N=60 is (1-beta) = 0.03

Incredibility index = 97%
97% of studies with these stats would have found at least 1 nonsignificant result.

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological methods, 17(4), 551–66. doi:10.1037/a0029487
BeantwoordenVerwijderen
Reacties
Anoniem20 augustus 2013 om 11:29
The percentages Rolf discusses above are the percentages for the pretest, where participants had to rate two differently labelled options. However, the way I understand it, the percentages shown in Figure 5 represent the percentage of participants in each group choosing the labelled option. That is, if I am not mistaken, it seems only the health boost option was labelled in the experiment and the other options were not. If this was the case, then the option that was made more sallient with either a novelty or classic cue was selected by just 26,5% of participants averaged over groups. Which is also a novel and interesting finding.
BeantwoordenVerwijderen
Reacties
Schotz22 augustus 2013 om 19:55
Hi Rolf,

Very interesting stuff. Have you received any contact from the authors of this study? Do you know if they are aware of your critique/questions? I would love to hear their responses.
BeantwoordenVerwijderen
Reacties
Unknown16 september 2013 om 20:54
My biggest question is how could you test 3 hypothesizes with only 3 experiments and publish it on PS.
BeantwoordenVerwijderen
Reacties
Richard Morey23 september 2013 om 16:21
Piggybacking on your point about possible confounds in the rooms, it seems fairly clear that "room" is a random effect, but they are treating it as a fixed effect. The effective sample size in the "orderly" versus "disorderly" groups are thus N=1 for each; thus, running more participants may tell us a lot about these two rooms, but next to nothing about the populations of "orderly" versus "disorderly" rooms. The problem of treating a random effect as fixed is a well-known methodological issue that a reviewer/editor should have picked up on.
BeantwoordenVerwijderen
Reacties
Unknown7 maart 2017 om 21:21
I am a researcher on a completely different field. My desk tends to sometimes look very messy, but it is my own mess. I think my own mess frequently helps with the thought process because the apparent mess is full of cues that are meaningful to me, and related to what I am working at the time. A mess staged by an outsider may carry or not information meaningful to the subjects, a mess that one has built oneself can be a very rich source of information. So replication should I think not only include enough subjects, but also different kinds of messy and tidy arrangements/environments to which each subject would be exposed. Or is the comparison between messy vs. tidy meaningful at all?
BeantwoordenVerwijderen
Reacties

Reactie toevoegen

Drang naar Samenhang

Zoeken in deze blog

50 Questions About Messy Rooms and Clean Data

Reacties

Een reactie posten