Tuesday, October 29, 2013

Premature Experimentation: Revaluing the Role of Essays and Thought Experiments

Some years ago I served as an outside member on a dissertation committee in a linguistics department. The graduate student in question wanted to conduct experiments, an unusual idea for a linguist. When the idea was discussed during the initial committee meeting, a colleague from the linguistics department sighed and said dismissively Ah, experiments. Psychologists always want to do experiments because they don’t know what’s going on. The years had not yet mellowed me (ahem), so I had to bite back a snide comment.

But now I’m starting to wonder if that linguist didn’t have a point after all. Isn’t our field suffering from premature experimentation? Don’t we all have a tendency to design and run experiments before research questions have been really thought through?

I see four major sets of reasons why this might be the case.
  1. Institutional. Empirical articles are the principal currency of our field, so there exists an incentive structure to design and run experiments. Graduate students need experiments to be awarded their degree, postdocs need them to secure a tenure-track job, junior faculty need them to receive tenure, senior faculty need them to procure grants. 
  2. Educational. Students need to learn the trade. Designing and running experiments is a complex set of skills that take years to master.
  3. Cognitive. Experiments are used as mental crutches. It’s hard to mentally manipulate abstract concepts. It’s easier to think about designs, variables, conditions, counterbalancing, and randomization because much of this can be grounded in perception and action (e.g., in the form of an imaginary 2 X 2 table), or offloaded on the environment (e.g., sketched out on the back of an envelope).
  4. Temperamental. We are eager to “get our hands dirty” and curious to see results early on. When I get a new piece of software or some home appliance, I’m usually too impatient to carefully read the manual and/or do the tutorial. I take a quick look and then play it by ear. Running experiments before having things thought through is like starting without having perused the manual.
The last two reasons may not apply to everyone (they all apply to me, I’m afraid) but it is clear that there is a pressure and a drive to produce experiments.

How do we counter this pressure? I believe we should re-evaluate the importance of speculative articles, essays in which authors try out and develop thoughts about a specific topic. The essay is reviewed much like a philosophy or linguistic article would be reviewed (e.g., based on theoretical relevance, soundness of argumentation and clarity of exposition) and is then introduced to the field, whereupon it may receive post-publication feedback, feedback that might give rise to further theory development. And then, at some point, the moment for experimentation has arrived.

Some topics are just not (yet) amenable to empirical research. But this does not mean they aren't interesting or worthwhile to discuss... (I had drafted this post up to this point about two weeks ago. Today someone on Twitter referred to an article that illustrates my point and helps me conclude the post.)
In an article that is currently in press in Perspectives on Psychological Science, the social psychologist Joseph Cesario takes a critical look at what he calls the literature on behavioral priming, which has been the focus of several posts in this blog. Cesario observes that the field is lacking theories that specify the precise contingencies under which particular priming effects can be obtained. He then asserts that therefore failures [to replicate priming effects] are uninformative at the current time.

I concur with Cesario’s theoretical criticism of priming research but I disagree with his statement that replications are uninformative. Moreover, if a researcher cannot state the conditions under which an effect is expected to replicate, then the study itself is uninformative. The experiment was conducted prematurely.

It is much better in such cases to just be upfront and present the idea as a thought experiment. The Gedankenexperiment has a venerable history in science. Moreover, it does not carry with it the pretense that there is empirical support for the idea.

There is more to say about the Cesario article but I’ll limit myself here to the conclusion that it nicely illustrates my main point: we should elevate essays to a higher status in our field and at the same time become warier of premature experimentation.

Just imagine what would (not) have happened if Diederik Stapel had not felt the need to produce “evidence” and had just described thought experiments in a series of essays.

Tuesday, October 22, 2013

"Effects are Public Property and not Personal Belongings": a Post-Publication Conversation

Welcome to a post-publication conversation on social-cognitive priming!  The impetus for the conversation is a social-cognitive priming article by Jostmann, Lakens, and Schubert that was published in 2009 in Psychological Science. The article is interesting enough in and of itself but what makes it an even more interesting discussion topic is that the authors themselves have performed and reported replication attempts of some of their findings. In addition, there are replication attempts by others.

The authors of the 2009 study, Nils Jostmann (NJ), Daniël Lakens (DL), and Thomas Schubert (TS), plus the author of a replication study, Hans IJzerman (HIJ), kindly agreed to respond to a series of questions I had prepared for them about the research. This allows a behind-the-scenes look of the original study, the decisions to perform replications, the evaluation of the replication attempts, and overall assessments of the main finding. The responses, which were given via email, are all the more interesting and instructive because they remarkably open and self-critical. By way of disclosure I should note that I know Daniël Lakens, Hans IJzerman, and Thomas Schubert personally.

The basic idea behind the 2009 study is that importance is associated with weight. There are of course several expressions that associate weight with importance, like weighty matter and the heavyweights of the field, but more relevant is that weight and importance are associated in perception and action. The authors observe that heavy objects are more difficult to move or yield than light objects (and therefore are energetically more demanding). Also, being struck by a heavy object has more consequences than being struck by a light object. Jostmann et al. summarize the situation as follows: heavy objects have more impact on our bodies than light ones do. Their thesis is that the concept of importance is grounded in weight and they test this idea in several experiments.

 From their discussion of the grounding of importance in weight, Jostmann et al. derive the hypothesis that weight will influence judgments of importance. In Study 1 they test this idea by having subjects estimate the value of foreign currencies while holding a heavy or a light clipboard.

Question 1
This is a clever idea but a more direct test of your hypothesis would have been to have subjects judge the value of the clipboard itself (or more realistically of some other object). Why did you forego this direct test? 
NJ: We wanted to test the abstract implications of the assumed link between weight and importance. It seemed trivial - at least to me - that the weight of an object had an effect on its estimated value. Later we heard of studies (about the value of wine bottles) that confirmed that heavy objects are more valuable (at least under certain circumstances). 
DL: This idea seemed (and still seems) almost trivial. In general, it seems the direct relation is less interesting than examining how the physical experience of weight influences judgments do not cause the experience of weight. There are now several studies that show such direct links (for examples, heavy wine bottles that are perceived to be more valuable) and we have found several of these effects in student projects (e.g., heavy paper cups are valued more, a light computer mouse is seen as less valuable compared to the same mouse with some lead hidden inside.
Question 2 
The inferential step you are making is that the weight of the clipboard gets transferred to the currency. This is an interesting idea. What is the mechanism you see at work here?  
NJ: Probably in the domain of money strong associations exist between weight and value, and apparently it's not so difficult to disguise that the felt weight is actually related to something irrelevant (i.e., the clipboard). 
DL: When we published the studies, my father said: Ah, so itʼs like an association, but then between a physical experience and an abstract concept? The more time passes, the more I think his description of the mechanism was pretty accurate.
Results (values averaged across currencies and across subjects) indicate that subjects holding the heavy clipboard judged the currencies to be more valuable than those holding the light clipboard, p = .04.

In Study 2, the researchers had subjects judge (again holding the clipboards) the importance of having a voice in a decision-making process. Their goal was to assess the effect of holding a clipboard in an abstract domain.
Question 3 
This seems a big step from Study 1. Do you think the same mechanism underlies responses in the two experiments?
NJ: I tend to think no. We ran Study 2 because we wanted to see whether the link between weight and importance affects judgments on topics that have nothing to do with weight (although some people have rightfully commented that justice is associated with a weighing scale). The mechanism is probably a bit more complex than in Study 1: weight is one dimension on which one can judge potency (i.e., what are the implications of something). It's not new that participants use this kind of dimensions to make judgments unless they can discount them (see Briñol & Petty, 2008, on an elaboration of how bodily cues can affect attitudes and persuasion on various levels).  
DL: I ran this study, and was doing some other studies on morality at the time. So in terms of things I was working on, it was actually a really small step. If you consider that people were most likely performing judgments under uncertainty in Study 1, and how fair something is can also be influenced quite easily, it seemed just more of the same, but a more relevant topic to examine.
Results showed that subjects in the heavy clipboard condition found having a voice in decision making more important than did participants in the light clipboard condition, p <  .05.

In Study 3, the authors reasoned that weight is associated with cognitive effort (e.g., it takes greater cognitive planning to move heavy objects than to move light ones). They tested this idea by having subjects (again with the clipboard) engage in a cognitive task and assess effort. Subjects described how much they liked the mayor of the city in which they were living, Amsterdam, and how satisfied they were with the city itself (quite sensibly, the subjects were very satisfied). The operationalization of cognitive elaboration was the correlation between the two types of statements.
Question 4
Again, this seems like a big step from the previous studies. Do you expect the same mechanism to be at work as in the previous studies? 
 NJ: In this study, the outcome could have been a different one: a heavy clipboard could have made participants evaluate the mayor to be more important, powerful, valuable etc. We did not find this but judgment coherence instead. The pattern made sense to us though because the attitude literature argues that coherence can occur when people find a topic important. So, we did not predict exactly this finding (see question 6). Back then, I believed that I had a good explanation and thus did not need to mention that the findings were in fact exploratory (I now think that I should have mentioned it). As for the mechanism: participants probably already had a relatively strong pre-existing attitude towards the mayor (there was some controversy in the media on his political measurements and some people found him weak while other found him strong). The heavy clipboard probably strengthened existing attitudes and made them more coherent but they couldn’t change them completely. Briñol and Petty have written a very interesting chapter on how bodily cues influence attitudes and persuasion. 
Results show that there was no main effect of clipboard but that there was a correlation between the mayor and city evaluations in the heavy clipboard condition (r=.42 p.<05) not in the light clipboard condition (r=-.23, n.s.). The authors conclude that there was more cognitive elaboration in the heavy clipboard condition than in the light clipboard condition.
Question 5 
I don’t quite understand how this task measures cognitive elaboration. It seems a rather indirect way. Can you clarify?  Also, was this the pattern you had predicted?
 NJ: You are right that cognitive elaboration is just one possible explanation. A better test was done in study 4. 
Study 4 examines the effect of weight on the evaluation of strong versus weak arguments, again in attempt to investigate the effect of weight on cognitive elaboration. The authors predict that holding the heavy clipboard will cause the subjects to assign proportionally more “weight” to strong arguments and less to weak ones, leading to more polarization in their evaluation of these arguments.
Question 6 
Again, this seems like a big step to me. What is the mechanism you think is at work here?
NJ: probably the same mechanism as in Study 2 and 3: weight signals potency, and if participants were looking for cues how important or valuable the issue at stake was for them, they used the - actually irrelevant - information of the clipboard weight.
The results show an interaction between clipboard and argument strength, p=.008. Although subjects holding the heavy clipboard agreed with more with the strong than with the weak arguments (p=.03) this difference was larger in the light clipboard condition (p<.001).

The authors conclude that “weight influences how people deal with abstract issues much as it influences how people deal with concrete objects: It leads to greater investment of effort. In our studies, weight led to greater elaboration of thought, as indicated by greater consistency between related judgments, greater polarization between judgments of strong versus weak arguments, and greater confidence in one’s opinion.”
Question 7  
Your studies focused more on abstract issues than on concrete objects. However, you did not conduct tests on concrete objects. Do you expect that the effect of weight would have been larger, equally large, or smaller if you had used objects? 
NJ: the effect seems to be stronger and more robust if the value of concrete objects is judged. We did the most difficult but perhaps also more interesting studies.
DL: I think that heavy and light objects are much more strongly related to psychological value. So, the effects should be larger. I would guess it is not difficult to find these effects – we have done so a number of times, and so have others.
The article was published and a number of years later, the authors did something remarkable. They posted a “failed” (we still have to establish what it means to say that a replication attempt “failed”) replication of one of the experiments in the paper (Study 3) on the PsychFiledrawer website.
Question 8  
What was your reasoning behind (1) conducting the replication and (2) posting it on the PsychFiledrawer site? 
NJ: we conducted the replication study on psychfiledrawer.org at the same time as the four published studies. As I have already said on psychfiledrawer, I believed back then that there were good reasons why the study failed (noise, changing attitudes, lack of power etc.) and it didn't occur to me that not mentioning it would do any harm. Only later when I learned that people were interested in the replicability of our finding (we received several requests to help with meta analyses) we decided that we should publish the results of the study.
DL: First of all, the ʽreplicationʼ was performed together with the initial studies, but because it was not significant, it was not submitted for publication. We now understand all our studies were underpowered, and not all studies should have been expected to work. When we published the paper, we did the normal thing, and not mention the non-significant finding. Now, with our increased understanding and thoughts about how you should do science, we wanted to do the right thing.
 Question 9  
How do you evaluate the result of your replication in the context of the original experiment?
 NJ: I still believe that there are good reasons why the study failed: noisy environment and a topic on which public attitudes were changing rapidly at that time. Lack of power was also a problem. 
DL: The robustness of the effect in that study remains uncertain.
There are two other replication attempts on the site performed by other researchers. One is a success (replicating the original Study 2) and the other a failure (not replicating the original Study 2).
Question 10  
How do you evaluate the result of these attempts?  
 NJ: Apparently, it's not so easy to replicate our effects but at least some independent researchers were successful. I think that there are some parameters that we still don't understand that are necessary to take into account to find the weight-importance effect. It would be cool if someone published a paper on when our findings replicate and when not (hopefully experimenter demand or other artefacts are not an issue but if so, I'll be able to live with it). 
DL: In the failed replication, there might have been a ceiling effect, as those authors note. Or, the effect might not be robust. We need meta-analyses to know more (and these are being performed).
IJzerman and colleagues attempted to replicate (Study 2 in their paper) the original Study 2 (apparently the most popular among replicators). They found that subjects holding the heavy clipboard gave higher importance ratings that subjects holding the light clipboard but this difference was not significant, p=.12.
Question 11
What were your reasons for performing this replication attempt? And how do you evaluate the results? Does significance matter in a replication result? 
HIJ Initially, a former student of mine (Justin Saddlemyer), Sander Koole and I wanted to investigate an individual difference variable that is both relevant to my earlier work (on warmth) and to Jostmann et al's (on weight). We started this project prior to the entire replication debate in psychology. So, the project started as a replication+extension project. Given the entire discussion on replication, we wanted to do an "intermediate reporting" of what we do know (the other results are promising, but we simply don't have enough answers yet to report in a publication).  
Also initially, we did not evaluate the replication as properly as we probably should have. I think Uri Simonsohn's method is a useful one. We had submitted the project to PloSOne, but for some odd reason PloSOne is doubting the ethical procedures. We hope to get that sorted out and will do the rewrite, including Uri's method of evaluating the effect sizes. So no, we don't think necessary the p value is the crucial way of evaluating. 
 Question 12 
How do the original authors evaluate this replication attempt and its result? 
NJ: Hans does not provide detailed information about how the study was run. The weight was different and the study was underpowered (as were ours). It's difficult to say why it didn't work. 
DL: I donʼt know the sample size in that study, but significance per se is less interesting when all studies are underpowered. Again, we have to wait for the meta-analysis. 
HIJ: Agreed on the underpowered. We still think it is useful to report these studies, but agree that if we were to run another study, we would probably do a registered report, examining all the details that we present in the Replication Recipe. If the present study were to be published, by the way, all details of the study will be uploaded to Dataverse, so the amount of detail will probably be greater than what we currently include in our research summaries (i.e., publication).   
 Question 13 
How do you evaluate the usefulness of replications in general? Should researchers try to replicate their own results? 
NJ: yes, they should whenever possible. 
HIJ: Agreed, but ideally another lab should be able to replicate our studies. For this to happen, we do need to start reporting more detail of our studies.   
 Question 14 
Taken together all the empirical evidence, how much support is there for the notion that weight influences judgments of importance? 
NJ: there is some support and I still believe that the link exists. Too many independent researchers (see M. Hafner, Experimental psychology) have successfully replicated our effects (even close replications) to make me think that we are dealing with a false positive. The effect might not be as strong as we thought though.
HIJ I think more so than most social psychology studies. That said, many studies - including ours - are underpowered. Given what Nils mentions and the general theoretical premise, I agree that it is unlikely that this is a false positive.   
DL: Well, there are some successful replications, many of the studies show effects in the right direction, and we seem to have a rather nice number of studies for a meta-analysis. I think overall there might be an indication something is going on, but we donʼt have a good grasp on the size of the effect, and the factors that influence the effect size. Still, it seems an interesting candidate to examine further. The effect of weight on concrete objects seems pretty large – the question is whether it extends to abstract concepts requires further examination. (Other replication attempts of the weight-importance study are listed at the bottom of this page at Lakens' site.)  
Question 15 
Do you have any additional comments?  
NJ: When I was a grad student I was surprised to see some researchers being very defensive regarding “their” effect. I’d like to thank all the people who contacted us about the reliability and validity of the weight-importance effect. They reminded me that effects are public property and not personal belongings. 
HIJ: I think for researchers it may sometimes be nerve-wracking if people try to replicate, in particular if they truly believe in certain effects. After all, your world view in a way is being violated. But, then again, I think these are exciting times, in that we get to know much more about effect sizes, how different effects scale up to one another, and what contextual factors are important in reproducing effects.  
TS: It was interesting though to see how you summarized the paper, and it made me realize something. Our paper was a combination of the embodiment idea that abstract concepts are grounded in concrete experience, and work on persuasion. Studies 1 and 2 were about the embodiment notion - importance of abstract issues, like monetary value and voice, were influenced by concrete experiences. Then, Studies 3 and 4 combined this with work on persuasion. Although we did not actually study persuasion, we measured outcomes of social cognition processes identified in work on attitudes in the persuasion literature. Our interpretation of these results was that the patterns (alignment of related attitudes and polarization of strong ones vs. weak ones) reflected effortful processing. It may have been pretty oblique in the paper, but there is a large body of work on attitudes in social cognition behind this notion. 
In hindsight, it might have made more sense to write two separate papers about these two parts, and to elaborate on each one more. Interestingly, the people who have followed up on this clearly were more interested in the first idea. 
Given that this is "a candid blog," it will surprise no one that I very much appreciate the candor and lack of defensiveness that is evident in these responses. I couldn't have summarized this discussion any better than Nils Jostmann just did: "effects are public property and not personal belongings." I look forward to hearing more about the meta-analyses of the weight-importance effect.

Tuesday, October 8, 2013

David Sedaris and the Power of the Spoken Word

Last week, David Sedaris gave a reading in Amsterdam as part of his latest book tour. When the performance was over, it dawned on me that something remarkable had happened. More than a thousand Dutch people had just stared for almost two hours at a soft-spoken and not physically imposing American man who was reading from sheets of paper. What was going on?

In our modern culture we don't seem to be able to get by without visuals. Schoolbooks are littered with photographs, diagrams, and figures. Most professors are incapable of lecturing without PowerPoint. News programs feature a plethora of graphs, pie charts, and animations. Heck, there even is a photograph on the left of this paragraph!

David Sedaris didn’t strut and prance across the stage while gesturing maniacally like a stand-up comedian or an overly excited TED-talker. He didn’t bring any visual props with him and certainly no PowerPoint presentation. Instead he was standing rather motionlessly behind a lectern, reading from a piece of paper in a deadpan manner with a slightly high-pitched voice.

Nevertheless, the audience seemed spellbound and rounds of laughter echoed through the theater. The show was over before I knew it. And I realized that along the way I, and probably the others in the audience as well, had been effectively transported to a taxidermist’s store in North London, a quiet village in Normandy, and an American hotel. And no visual aids were needed to accomplish this. So what was going on?

There are two types of situations that play a role in linguistic communication. One is the situation in which the communication takes place, in this case the Amsterdam theater Carré with a 1000+ audience; the other is the situation that the communication is about, let’s say the taxidermy store in North London. We can call these the communicative and the referential situation (the situation that we want to understand), respectively. In linguistic communication, these two types of situations have an elastic bond. Sometimes they overlap almost completely and at other times there is hardly any overlap at all.

Let’s first look at an example of where there is almost perfect overlap: a cooking show. Here the speaker talks about the situation he is acting in. The Portobello mushroom that Jamie Oliver is referring to is right there in front of him; it is not some fictional fungus. The actions that he is describing—slicing and seasoning the mushroom—are the actions he is performing at this very moment. The person he is calling “I” is the person who is simultaneously speaking and performing the actions: Jamie Oliver. The role of language is to direct attention across the visual scene. Naming an object prompts the eye to fixate that aspect of the scene and to encode whatever is there. The ingredients for understanding are readily available and language points us to them.

A moderate level of overlap occurs when a past or future state of the environment is projected onto the current environment. A friend who has recently remodeled his house might point to a kitchen island and explain that a wall used to stand there and that where the breakfast nook is right now there used to be the back door. Or the reverse might happen. Our friend might tell us about his remodeling plans. The wall between the kitchen and living room will be torn out to make room for a kitchen island. To understand the past or future situation, the listener can make use of various cues in the communicative situation. Eye movements serve to mark locations where objects or individuals were in the past or are expected to be in the future. All the listener needs to do is to imagine an object, person, or action in that location (presumably after having suppressed the object that’s actually there).

Finally, there are cases where there is practically no overlap between the communicative and referential situation. And this is the case that concerns us here: David Sedaris and the 1000 to 1200 Dutchmen.

There is no information in the communicative environment to focus attention on (no visual information at least), so language has to do the heavy lifting. The referential situation cannot be piggybacked onto the communicative situation (although there is some evidence that even in the absence of relevant cues people make meaningful eye movements). Language cannot be used to point to things that are already there. This probably explains why Sedaris’ prose is a lot more intricate than Jamie Oliver’s. The latter doesn’t have to put much thought into the composition of a sentence; he can make do with a simple I’m putting the garlic into the pan. Sedaris, on the other hand, has to craft a sentence with exquisite precision. He has to create a situation just from words. Of course, this is a problem faced by all novelists and some of them are quite successful at conjuring up fictional worlds.

What, then, is the added value of going to the theater to hear somebody read from his own work? (And why are people paying good money to do so?) In interviews Sedaris explains that when he writes a story, he reads it aloud to himself and makes changes until it reads well. His stories are designed to be read aloud. In the theater it was clear that Sedaris keenly anticipates and monitors the responses from his audience and, like a stand-up comedian, times and intonates his utterances accordingly for maximum effect.

The pacing of the telling of the story facilitates the mental transportation of the audience from the Amsterdam theater over to the North London taxidermy shop. It facilitates the audience’s ability to resonate to the narrator’s combination of fascination and politely suppressed horror when he is successively shown the skeleton of a Pygmy, a severed arm with a tattoo on it, and the head of a 13-year old Peruvian girl. It also facilitates understanding the narrator’s feeling of wonder that the shopkeeper instantly knew him for what he was: the type who’d actually love a Pygmy, and could easily get over the fact that he’d been murdered for sport, thinking, breezily, “Well, it was a long time ago.”

The audience signals its understanding (and appreciation) of the story by emitting gales of laughter. These, in turn, determine the speaker’s highly skilled timing and intonation of upcoming phrases. This heightens the audience’s involvement in the story world. This way, an effective feedback loop is created. It is a subtle form of alignment between speaker and listeners.

The lack of visual props probably heightens this effect. There is nothing in the communicative situation to exert what Andy Clark calls a “gravitational pull” on the audience—pulling it back into the communicative situation—so that all attention can be devoted to the story as it unfolds at the pace determined by the speaker, expertly tailored to the audience’s immediate responses.

At least, that’s what I think was going on that night in Amsterdam.