Tuesday, December 10, 2013

Time, Money, and Morality

A paper that is in press in Psychological Science tests the hypothesis that priming someone with the concept of time makes them cheat less than someone who is not thusly primed. Or, as the authors articulate the idea in in the abstract: Across four experiments, we examined whether shifting focus onto time can salvage individuals’ ethicality. 

I've said a lot already about the type of theorizing and experimenting in this type of priming research, so I just want to keep it simple this time and concentrate on something that is currently under fire in the literature, even on the pages of Psychological Science itself, the p-value. 

As the abstract indicates, there are four experiments. In each experiment, the key prediction is that exposure to the "time-prime" causes people to cheat less. Each prediction is evaluated on the basis of a p-value. In Experiment 1, the prediction was that subjects would cheat less in the "time-prime" condition than in the control condition. (There also was a money-prime condition, but this was not germane to the key hypothesis.) I've highlighted the key result.

In Experiment 2 the key hypothesis was that if priming time decreases cheating by making people reflect on who they are, cheating behavior in the latter condition would not differ between participants primed with money and those primed with time. However, participants who were told that the game was a test of intelligence would show the same effect observed in Experiment 1. So the authors predicted an interaction between reflection (reflection vs. no reflection) and type of prime (time vs. money). Here are the results.

In Experiment 3 the authors manipulated self-reflection in a literal way: subjects were or were not seated in front of a mirror and this was crossed with prime condition (money vs. time). Again, the key prediction involved an interaction. 

Finally, in Experiment 4 the three priming conditions of Experiment 1 were used (money, time, control), which produced the following results.

So we have four experiments, each with their key prediction supported by a p-value between .04 and .05. How likely are these results? 

This question can be answered with a method developed by Simonsohn, Simmons, and Nelson (in press). To quote from the abstract: Because scientists tend to report only studies (publication bias) or analyses (p-hacking) that “work”, readers must ask, “Are these effects true, or do they merely reflect selective reporting?” We introduce p-curve as a way to answer this question. P-curve is the distribution of statistically significant p-values for a set of studies (ps < .05).

Simonsohn and colleagues have developed a web app that makes it very easy to compute p-curves. I used that app to compute the p-curve for the four experiments, using the p-values for the key hypotheses.

So if  I did everything correctly, the app concludes that the experiments in this study had no evidential value and were intensely p-hacked. 

It is somewhat ironic that the second author of the Psych Science paper and the first author of the p-curve paper are at the same institution. This is illustrative of the current state of methodological flux that our field is in: radically different views of what constitutes evidence co-exist in institutions and journals (e.g., Psychological Science). 

Thursday, November 28, 2013

What Can we Learn from the Many Labs Replication Project?

The first massive replication project in psychology has just reached completion (several others are to follow). A large group of researchers, which I will refer to as ManyLabs, has attempted to replicate 15 findings from the psychological literature in various labs across the world. The paper is posted on the Open Science Framework (along with the data) and Ed Yong has authored a very accessible write-up. [Update May 20, 2014, the article is out now and is open access.]

What can we learn from the ManyLabs project? The results here show the effect sizes for the replication efforts (in green and grey) as well as the original studies (in blue). The 99% confidence intervals are for the meta-analysis of the effect size (the green dots); the studies are ordered by effect size.

Let’s first consider what we canNOT learn from these data. Of the 13 replication attempts (when the first four are taken together), 11 succeeded and 2 did not (in fact, at some point ManyLabs suggests that a third one, Imagined Contact also doesn’t really replicate). We cannot learn from this that the vast majority of psychological findings will replicate, contrary to this Science headline, which states that these findings “offer reassurance” about the reproducibility of psychological findings. As Ed Yong (@edyong209) joked on Twitter, perhaps ManyLabs has stumbled on the only 12 or 13 psychological findings that replicate! Because the 15 experiments were not a random sample of all psychology findings and it’s a small sample anyway, the percentage is not informative, as ManyLabs duly notes.

But even if we had an accurate estimate of the percentage of findings that replicate, how useful would that be? Rather than trying to arrive at a more precise estimate, it might be more informative to follow up the ManyLabs projects with projects that focus on a specific research area or topic, as I proposed in my first-ever post, as this might lead to theory advancement.

So what DO we learn from the ManyLabs project? We learn that for some experiments, the replications actually yield much larger effects that the original studies, a highly intriguing findings that warrants further analysis.

We also learn that the two social priming studies in the sample, dangling at the bottom of the list in the figure, were resoundingly nonreplicated. One study found that exposure to the United States flag increases conservatism among Americans; the other study found that exposure to money increases endorsement of the current social system. The replications show that there essentially is no effect whatsoever for either of these exposures.

It is striking how far the effects sizes of the original studies (indicated by an x) are away from the rest of the experiments. There they are, by their lone selves at the bottom right of the figure. Given that all of the data from the replication studies have been posted online, it would be fascinating to get the data from the original studies. Comparisons of the various data sets might shed light on why these studies are such outliers.

We also learn that the online experiments in the project yielded results that are highly similar to those produced by lab experiments. This does not mean, of course, that any experiment can be transferred to an online environment, but it certainly inspires confidence in the utility of online experiments in replication research.

Most importantly, we learn that several labs working together yield data that have an enormous evidentiary power. At the same time, it is clear that such large-scale replication projects will have diminishing returns (for example, the field cannot afford to devote countless massive replication efforts to not replicating all the social priming experiments that are out there). However, rather than using the ManyLabs approach retrospectively, we can also use it prospectively: to test novel hypotheses.

Here is how this might go.

(1) A group of researchers form a hypothesis (not by pulling it out this air but by deriving it from a theory, obviously).
(2) They design—perhaps via crowd sourcing—the best possible experiment.
(3) They preregister the experiment.
(4) They post the protocol online.
(5) They simultaneously carry out the experiment in multiple labs.
(6) They analyze and meta-analyze the data.
(7) They post the data online.
(8) They write a kick-ass paper.

And so I agree with the ManyLabs authors when they conclude that a consortium of laboratories could provide mutual support for each other by conducting similar large-scale investigations on original research questions, not just replications. Among the many accomplishments of the ManyLabs project, showing us the feasibility of this approach might be its major one.

Tuesday, October 29, 2013

Premature Experimentation: Revaluing the Role of Essays and Thought Experiments

Some years ago I served as an outside member on a dissertation committee in a linguistics department. The graduate student in question wanted to conduct experiments, an unusual idea for a linguist. When the idea was discussed during the initial committee meeting, a colleague from the linguistics department sighed and said dismissively Ah, experiments. Psychologists always want to do experiments because they don’t know what’s going on. The years had not yet mellowed me (ahem), so I had to bite back a snide comment.

But now I’m starting to wonder if that linguist didn’t have a point after all. Isn’t our field suffering from premature experimentation? Don’t we all have a tendency to design and run experiments before research questions have been really thought through?

I see four major sets of reasons why this might be the case.
  1. Institutional. Empirical articles are the principal currency of our field, so there exists an incentive structure to design and run experiments. Graduate students need experiments to be awarded their degree, postdocs need them to secure a tenure-track job, junior faculty need them to receive tenure, senior faculty need them to procure grants. 
  2. Educational. Students need to learn the trade. Designing and running experiments is a complex set of skills that take years to master.
  3. Cognitive. Experiments are used as mental crutches. It’s hard to mentally manipulate abstract concepts. It’s easier to think about designs, variables, conditions, counterbalancing, and randomization because much of this can be grounded in perception and action (e.g., in the form of an imaginary 2 X 2 table), or offloaded on the environment (e.g., sketched out on the back of an envelope).
  4. Temperamental. We are eager to “get our hands dirty” and curious to see results early on. When I get a new piece of software or some home appliance, I’m usually too impatient to carefully read the manual and/or do the tutorial. I take a quick look and then play it by ear. Running experiments before having things thought through is like starting without having perused the manual.
The last two reasons may not apply to everyone (they all apply to me, I’m afraid) but it is clear that there is a pressure and a drive to produce experiments.

How do we counter this pressure? I believe we should re-evaluate the importance of speculative articles, essays in which authors try out and develop thoughts about a specific topic. The essay is reviewed much like a philosophy or linguistic article would be reviewed (e.g., based on theoretical relevance, soundness of argumentation and clarity of exposition) and is then introduced to the field, whereupon it may receive post-publication feedback, feedback that might give rise to further theory development. And then, at some point, the moment for experimentation has arrived.

Some topics are just not (yet) amenable to empirical research. But this does not mean they aren't interesting or worthwhile to discuss... (I had drafted this post up to this point about two weeks ago. Today someone on Twitter referred to an article that illustrates my point and helps me conclude the post.)
In an article that is currently in press in Perspectives on Psychological Science, the social psychologist Joseph Cesario takes a critical look at what he calls the literature on behavioral priming, which has been the focus of several posts in this blog. Cesario observes that the field is lacking theories that specify the precise contingencies under which particular priming effects can be obtained. He then asserts that therefore failures [to replicate priming effects] are uninformative at the current time.

I concur with Cesario’s theoretical criticism of priming research but I disagree with his statement that replications are uninformative. Moreover, if a researcher cannot state the conditions under which an effect is expected to replicate, then the study itself is uninformative. The experiment was conducted prematurely.

It is much better in such cases to just be upfront and present the idea as a thought experiment. The Gedankenexperiment has a venerable history in science. Moreover, it does not carry with it the pretense that there is empirical support for the idea.

There is more to say about the Cesario article but I’ll limit myself here to the conclusion that it nicely illustrates my main point: we should elevate essays to a higher status in our field and at the same time become warier of premature experimentation.

Just imagine what would (not) have happened if Diederik Stapel had not felt the need to produce “evidence” and had just described thought experiments in a series of essays.

Tuesday, October 22, 2013

"Effects are Public Property and not Personal Belongings": a Post-Publication Conversation

Welcome to a post-publication conversation on social-cognitive priming!  The impetus for the conversation is a social-cognitive priming article by Jostmann, Lakens, and Schubert that was published in 2009 in Psychological Science. The article is interesting enough in and of itself but what makes it an even more interesting discussion topic is that the authors themselves have performed and reported replication attempts of some of their findings. In addition, there are replication attempts by others.

The authors of the 2009 study, Nils Jostmann (NJ), Daniël Lakens (DL), and Thomas Schubert (TS), plus the author of a replication study, Hans IJzerman (HIJ), kindly agreed to respond to a series of questions I had prepared for them about the research. This allows a behind-the-scenes look of the original study, the decisions to perform replications, the evaluation of the replication attempts, and overall assessments of the main finding. The responses, which were given via email, are all the more interesting and instructive because they remarkably open and self-critical. By way of disclosure I should note that I know Daniël Lakens, Hans IJzerman, and Thomas Schubert personally.

The basic idea behind the 2009 study is that importance is associated with weight. There are of course several expressions that associate weight with importance, like weighty matter and the heavyweights of the field, but more relevant is that weight and importance are associated in perception and action. The authors observe that heavy objects are more difficult to move or yield than light objects (and therefore are energetically more demanding). Also, being struck by a heavy object has more consequences than being struck by a light object. Jostmann et al. summarize the situation as follows: heavy objects have more impact on our bodies than light ones do. Their thesis is that the concept of importance is grounded in weight and they test this idea in several experiments.

 From their discussion of the grounding of importance in weight, Jostmann et al. derive the hypothesis that weight will influence judgments of importance. In Study 1 they test this idea by having subjects estimate the value of foreign currencies while holding a heavy or a light clipboard.

Question 1
This is a clever idea but a more direct test of your hypothesis would have been to have subjects judge the value of the clipboard itself (or more realistically of some other object). Why did you forego this direct test? 
NJ: We wanted to test the abstract implications of the assumed link between weight and importance. It seemed trivial - at least to me - that the weight of an object had an effect on its estimated value. Later we heard of studies (about the value of wine bottles) that confirmed that heavy objects are more valuable (at least under certain circumstances). 
DL: This idea seemed (and still seems) almost trivial. In general, it seems the direct relation is less interesting than examining how the physical experience of weight influences judgments do not cause the experience of weight. There are now several studies that show such direct links (for examples, heavy wine bottles that are perceived to be more valuable) and we have found several of these effects in student projects (e.g., heavy paper cups are valued more, a light computer mouse is seen as less valuable compared to the same mouse with some lead hidden inside.
Question 2 
The inferential step you are making is that the weight of the clipboard gets transferred to the currency. This is an interesting idea. What is the mechanism you see at work here?  
NJ: Probably in the domain of money strong associations exist between weight and value, and apparently it's not so difficult to disguise that the felt weight is actually related to something irrelevant (i.e., the clipboard). 
DL: When we published the studies, my father said: Ah, so itʼs like an association, but then between a physical experience and an abstract concept? The more time passes, the more I think his description of the mechanism was pretty accurate.
Results (values averaged across currencies and across subjects) indicate that subjects holding the heavy clipboard judged the currencies to be more valuable than those holding the light clipboard, p = .04.

In Study 2, the researchers had subjects judge (again holding the clipboards) the importance of having a voice in a decision-making process. Their goal was to assess the effect of holding a clipboard in an abstract domain.
Question 3 
This seems a big step from Study 1. Do you think the same mechanism underlies responses in the two experiments?
NJ: I tend to think no. We ran Study 2 because we wanted to see whether the link between weight and importance affects judgments on topics that have nothing to do with weight (although some people have rightfully commented that justice is associated with a weighing scale). The mechanism is probably a bit more complex than in Study 1: weight is one dimension on which one can judge potency (i.e., what are the implications of something). It's not new that participants use this kind of dimensions to make judgments unless they can discount them (see Briñol & Petty, 2008, on an elaboration of how bodily cues can affect attitudes and persuasion on various levels).  
DL: I ran this study, and was doing some other studies on morality at the time. So in terms of things I was working on, it was actually a really small step. If you consider that people were most likely performing judgments under uncertainty in Study 1, and how fair something is can also be influenced quite easily, it seemed just more of the same, but a more relevant topic to examine.
Results showed that subjects in the heavy clipboard condition found having a voice in decision making more important than did participants in the light clipboard condition, p <  .05.

In Study 3, the authors reasoned that weight is associated with cognitive effort (e.g., it takes greater cognitive planning to move heavy objects than to move light ones). They tested this idea by having subjects (again with the clipboard) engage in a cognitive task and assess effort. Subjects described how much they liked the mayor of the city in which they were living, Amsterdam, and how satisfied they were with the city itself (quite sensibly, the subjects were very satisfied). The operationalization of cognitive elaboration was the correlation between the two types of statements.
Question 4
Again, this seems like a big step from the previous studies. Do you expect the same mechanism to be at work as in the previous studies? 
 NJ: In this study, the outcome could have been a different one: a heavy clipboard could have made participants evaluate the mayor to be more important, powerful, valuable etc. We did not find this but judgment coherence instead. The pattern made sense to us though because the attitude literature argues that coherence can occur when people find a topic important. So, we did not predict exactly this finding (see question 6). Back then, I believed that I had a good explanation and thus did not need to mention that the findings were in fact exploratory (I now think that I should have mentioned it). As for the mechanism: participants probably already had a relatively strong pre-existing attitude towards the mayor (there was some controversy in the media on his political measurements and some people found him weak while other found him strong). The heavy clipboard probably strengthened existing attitudes and made them more coherent but they couldn’t change them completely. Briñol and Petty have written a very interesting chapter on how bodily cues influence attitudes and persuasion. 
Results show that there was no main effect of clipboard but that there was a correlation between the mayor and city evaluations in the heavy clipboard condition (r=.42 p.<05) not in the light clipboard condition (r=-.23, n.s.). The authors conclude that there was more cognitive elaboration in the heavy clipboard condition than in the light clipboard condition.
Question 5 
I don’t quite understand how this task measures cognitive elaboration. It seems a rather indirect way. Can you clarify?  Also, was this the pattern you had predicted?
 NJ: You are right that cognitive elaboration is just one possible explanation. A better test was done in study 4. 
Study 4 examines the effect of weight on the evaluation of strong versus weak arguments, again in attempt to investigate the effect of weight on cognitive elaboration. The authors predict that holding the heavy clipboard will cause the subjects to assign proportionally more “weight” to strong arguments and less to weak ones, leading to more polarization in their evaluation of these arguments.
Question 6 
Again, this seems like a big step to me. What is the mechanism you think is at work here?
NJ: probably the same mechanism as in Study 2 and 3: weight signals potency, and if participants were looking for cues how important or valuable the issue at stake was for them, they used the - actually irrelevant - information of the clipboard weight.
The results show an interaction between clipboard and argument strength, p=.008. Although subjects holding the heavy clipboard agreed with more with the strong than with the weak arguments (p=.03) this difference was larger in the light clipboard condition (p<.001).

The authors conclude that “weight influences how people deal with abstract issues much as it influences how people deal with concrete objects: It leads to greater investment of effort. In our studies, weight led to greater elaboration of thought, as indicated by greater consistency between related judgments, greater polarization between judgments of strong versus weak arguments, and greater confidence in one’s opinion.”
Question 7  
Your studies focused more on abstract issues than on concrete objects. However, you did not conduct tests on concrete objects. Do you expect that the effect of weight would have been larger, equally large, or smaller if you had used objects? 
NJ: the effect seems to be stronger and more robust if the value of concrete objects is judged. We did the most difficult but perhaps also more interesting studies.
DL: I think that heavy and light objects are much more strongly related to psychological value. So, the effects should be larger. I would guess it is not difficult to find these effects – we have done so a number of times, and so have others.
The article was published and a number of years later, the authors did something remarkable. They posted a “failed” (we still have to establish what it means to say that a replication attempt “failed”) replication of one of the experiments in the paper (Study 3) on the PsychFiledrawer website.
Question 8  
What was your reasoning behind (1) conducting the replication and (2) posting it on the PsychFiledrawer site? 
NJ: we conducted the replication study on psychfiledrawer.org at the same time as the four published studies. As I have already said on psychfiledrawer, I believed back then that there were good reasons why the study failed (noise, changing attitudes, lack of power etc.) and it didn't occur to me that not mentioning it would do any harm. Only later when I learned that people were interested in the replicability of our finding (we received several requests to help with meta analyses) we decided that we should publish the results of the study.
DL: First of all, the ʽreplicationʼ was performed together with the initial studies, but because it was not significant, it was not submitted for publication. We now understand all our studies were underpowered, and not all studies should have been expected to work. When we published the paper, we did the normal thing, and not mention the non-significant finding. Now, with our increased understanding and thoughts about how you should do science, we wanted to do the right thing.
 Question 9  
How do you evaluate the result of your replication in the context of the original experiment?
 NJ: I still believe that there are good reasons why the study failed: noisy environment and a topic on which public attitudes were changing rapidly at that time. Lack of power was also a problem. 
DL: The robustness of the effect in that study remains uncertain.
There are two other replication attempts on the site performed by other researchers. One is a success (replicating the original Study 2) and the other a failure (not replicating the original Study 2).
Question 10  
How do you evaluate the result of these attempts?  
 NJ: Apparently, it's not so easy to replicate our effects but at least some independent researchers were successful. I think that there are some parameters that we still don't understand that are necessary to take into account to find the weight-importance effect. It would be cool if someone published a paper on when our findings replicate and when not (hopefully experimenter demand or other artefacts are not an issue but if so, I'll be able to live with it). 
DL: In the failed replication, there might have been a ceiling effect, as those authors note. Or, the effect might not be robust. We need meta-analyses to know more (and these are being performed).
IJzerman and colleagues attempted to replicate (Study 2 in their paper) the original Study 2 (apparently the most popular among replicators). They found that subjects holding the heavy clipboard gave higher importance ratings that subjects holding the light clipboard but this difference was not significant, p=.12.
Question 11
What were your reasons for performing this replication attempt? And how do you evaluate the results? Does significance matter in a replication result? 
HIJ Initially, a former student of mine (Justin Saddlemyer), Sander Koole and I wanted to investigate an individual difference variable that is both relevant to my earlier work (on warmth) and to Jostmann et al's (on weight). We started this project prior to the entire replication debate in psychology. So, the project started as a replication+extension project. Given the entire discussion on replication, we wanted to do an "intermediate reporting" of what we do know (the other results are promising, but we simply don't have enough answers yet to report in a publication).  
Also initially, we did not evaluate the replication as properly as we probably should have. I think Uri Simonsohn's method is a useful one. We had submitted the project to PloSOne, but for some odd reason PloSOne is doubting the ethical procedures. We hope to get that sorted out and will do the rewrite, including Uri's method of evaluating the effect sizes. So no, we don't think necessary the p value is the crucial way of evaluating. 
 Question 12 
How do the original authors evaluate this replication attempt and its result? 
NJ: Hans does not provide detailed information about how the study was run. The weight was different and the study was underpowered (as were ours). It's difficult to say why it didn't work. 
DL: I donʼt know the sample size in that study, but significance per se is less interesting when all studies are underpowered. Again, we have to wait for the meta-analysis. 
HIJ: Agreed on the underpowered. We still think it is useful to report these studies, but agree that if we were to run another study, we would probably do a registered report, examining all the details that we present in the Replication Recipe. If the present study were to be published, by the way, all details of the study will be uploaded to Dataverse, so the amount of detail will probably be greater than what we currently include in our research summaries (i.e., publication).   
 Question 13 
How do you evaluate the usefulness of replications in general? Should researchers try to replicate their own results? 
NJ: yes, they should whenever possible. 
HIJ: Agreed, but ideally another lab should be able to replicate our studies. For this to happen, we do need to start reporting more detail of our studies.   
 Question 14 
Taken together all the empirical evidence, how much support is there for the notion that weight influences judgments of importance? 
NJ: there is some support and I still believe that the link exists. Too many independent researchers (see M. Hafner, Experimental psychology) have successfully replicated our effects (even close replications) to make me think that we are dealing with a false positive. The effect might not be as strong as we thought though.
HIJ I think more so than most social psychology studies. That said, many studies - including ours - are underpowered. Given what Nils mentions and the general theoretical premise, I agree that it is unlikely that this is a false positive.   
DL: Well, there are some successful replications, many of the studies show effects in the right direction, and we seem to have a rather nice number of studies for a meta-analysis. I think overall there might be an indication something is going on, but we donʼt have a good grasp on the size of the effect, and the factors that influence the effect size. Still, it seems an interesting candidate to examine further. The effect of weight on concrete objects seems pretty large – the question is whether it extends to abstract concepts requires further examination. (Other replication attempts of the weight-importance study are listed at the bottom of this page at Lakens' site.)  
Question 15 
Do you have any additional comments?  
NJ: When I was a grad student I was surprised to see some researchers being very defensive regarding “their” effect. I’d like to thank all the people who contacted us about the reliability and validity of the weight-importance effect. They reminded me that effects are public property and not personal belongings. 
HIJ: I think for researchers it may sometimes be nerve-wracking if people try to replicate, in particular if they truly believe in certain effects. After all, your world view in a way is being violated. But, then again, I think these are exciting times, in that we get to know much more about effect sizes, how different effects scale up to one another, and what contextual factors are important in reproducing effects.  
TS: It was interesting though to see how you summarized the paper, and it made me realize something. Our paper was a combination of the embodiment idea that abstract concepts are grounded in concrete experience, and work on persuasion. Studies 1 and 2 were about the embodiment notion - importance of abstract issues, like monetary value and voice, were influenced by concrete experiences. Then, Studies 3 and 4 combined this with work on persuasion. Although we did not actually study persuasion, we measured outcomes of social cognition processes identified in work on attitudes in the persuasion literature. Our interpretation of these results was that the patterns (alignment of related attitudes and polarization of strong ones vs. weak ones) reflected effortful processing. It may have been pretty oblique in the paper, but there is a large body of work on attitudes in social cognition behind this notion. 
In hindsight, it might have made more sense to write two separate papers about these two parts, and to elaborate on each one more. Interestingly, the people who have followed up on this clearly were more interested in the first idea. 
Given that this is "a candid blog," it will surprise no one that I very much appreciate the candor and lack of defensiveness that is evident in these responses. I couldn't have summarized this discussion any better than Nils Jostmann just did: "effects are public property and not personal belongings." I look forward to hearing more about the meta-analyses of the weight-importance effect.

Tuesday, October 8, 2013

David Sedaris and the Power of the Spoken Word

Last week, David Sedaris gave a reading in Amsterdam as part of his latest book tour. When the performance was over, it dawned on me that something remarkable had happened. More than a thousand Dutch people had just stared for almost two hours at a soft-spoken and not physically imposing American man who was reading from sheets of paper. What was going on?

In our modern culture we don't seem to be able to get by without visuals. Schoolbooks are littered with photographs, diagrams, and figures. Most professors are incapable of lecturing without PowerPoint. News programs feature a plethora of graphs, pie charts, and animations. Heck, there even is a photograph on the left of this paragraph!

David Sedaris didn’t strut and prance across the stage while gesturing maniacally like a stand-up comedian or an overly excited TED-talker. He didn’t bring any visual props with him and certainly no PowerPoint presentation. Instead he was standing rather motionlessly behind a lectern, reading from a piece of paper in a deadpan manner with a slightly high-pitched voice.

Nevertheless, the audience seemed spellbound and rounds of laughter echoed through the theater. The show was over before I knew it. And I realized that along the way I, and probably the others in the audience as well, had been effectively transported to a taxidermist’s store in North London, a quiet village in Normandy, and an American hotel. And no visual aids were needed to accomplish this. So what was going on?

There are two types of situations that play a role in linguistic communication. One is the situation in which the communication takes place, in this case the Amsterdam theater Carré with a 1000+ audience; the other is the situation that the communication is about, let’s say the taxidermy store in North London. We can call these the communicative and the referential situation (the situation that we want to understand), respectively. In linguistic communication, these two types of situations have an elastic bond. Sometimes they overlap almost completely and at other times there is hardly any overlap at all.

Let’s first look at an example of where there is almost perfect overlap: a cooking show. Here the speaker talks about the situation he is acting in. The Portobello mushroom that Jamie Oliver is referring to is right there in front of him; it is not some fictional fungus. The actions that he is describing—slicing and seasoning the mushroom—are the actions he is performing at this very moment. The person he is calling “I” is the person who is simultaneously speaking and performing the actions: Jamie Oliver. The role of language is to direct attention across the visual scene. Naming an object prompts the eye to fixate that aspect of the scene and to encode whatever is there. The ingredients for understanding are readily available and language points us to them.

A moderate level of overlap occurs when a past or future state of the environment is projected onto the current environment. A friend who has recently remodeled his house might point to a kitchen island and explain that a wall used to stand there and that where the breakfast nook is right now there used to be the back door. Or the reverse might happen. Our friend might tell us about his remodeling plans. The wall between the kitchen and living room will be torn out to make room for a kitchen island. To understand the past or future situation, the listener can make use of various cues in the communicative situation. Eye movements serve to mark locations where objects or individuals were in the past or are expected to be in the future. All the listener needs to do is to imagine an object, person, or action in that location (presumably after having suppressed the object that’s actually there).

Finally, there are cases where there is practically no overlap between the communicative and referential situation. And this is the case that concerns us here: David Sedaris and the 1000 to 1200 Dutchmen.

There is no information in the communicative environment to focus attention on (no visual information at least), so language has to do the heavy lifting. The referential situation cannot be piggybacked onto the communicative situation (although there is some evidence that even in the absence of relevant cues people make meaningful eye movements). Language cannot be used to point to things that are already there. This probably explains why Sedaris’ prose is a lot more intricate than Jamie Oliver’s. The latter doesn’t have to put much thought into the composition of a sentence; he can make do with a simple I’m putting the garlic into the pan. Sedaris, on the other hand, has to craft a sentence with exquisite precision. He has to create a situation just from words. Of course, this is a problem faced by all novelists and some of them are quite successful at conjuring up fictional worlds.

What, then, is the added value of going to the theater to hear somebody read from his own work? (And why are people paying good money to do so?) In interviews Sedaris explains that when he writes a story, he reads it aloud to himself and makes changes until it reads well. His stories are designed to be read aloud. In the theater it was clear that Sedaris keenly anticipates and monitors the responses from his audience and, like a stand-up comedian, times and intonates his utterances accordingly for maximum effect.

The pacing of the telling of the story facilitates the mental transportation of the audience from the Amsterdam theater over to the North London taxidermy shop. It facilitates the audience’s ability to resonate to the narrator’s combination of fascination and politely suppressed horror when he is successively shown the skeleton of a Pygmy, a severed arm with a tattoo on it, and the head of a 13-year old Peruvian girl. It also facilitates understanding the narrator’s feeling of wonder that the shopkeeper instantly knew him for what he was: the type who’d actually love a Pygmy, and could easily get over the fact that he’d been murdered for sport, thinking, breezily, “Well, it was a long time ago.”

The audience signals its understanding (and appreciation) of the story by emitting gales of laughter. These, in turn, determine the speaker’s highly skilled timing and intonation of upcoming phrases. This heightens the audience’s involvement in the story world. This way, an effective feedback loop is created. It is a subtle form of alignment between speaker and listeners.

The lack of visual props probably heightens this effect. There is nothing in the communicative situation to exert what Andy Clark calls a “gravitational pull” on the audience—pulling it back into the communicative situation—so that all attention can be devoted to the story as it unfolds at the pace determined by the speaker, expertly tailored to the audience’s immediate responses.

At least, that’s what I think was going on that night in Amsterdam. 

Friday, September 27, 2013

30 Questions about Priming with Science and the Department of Corrections

We know about claims that priming with “professor” makes you perform better on a general knowledge test but apparently the benefits of science don’t stop there. A study published earlier this year reports findings that priming with science-related words (logical, theory, laboratory, hypothesis, experiment) makes you more moral. Aren’t we scientists great or what? But before popping the cork on a bottle of champagne, we might want to ask some questions, not just about the research itself but also about the review and publishing process involving this paper. So here goes.

(1) The authors note (without boring the reader with details) that philosophers and historians have argued that science plays a key role in the moral vision of a society of “mutual benefit.” From this they derive the prediction that this notion of science facilitates moral and prosocial judgments. Isn’t this a little fast?
(2) Images of the “evil scientist” (in movies usually portrayed by an actor with a vaguely European accent) pervade modern culture. So if it takes only a cursory discussion of some literature to form a prediction, couldn’t one just as easily predict that priming with science makes you less moral? I’m not saying it does of course; I’m merely questioning the theoretical basis for the prediction.
(3) In Study 1, subjects read a date rape vignette (a little story about a date rape). The vignette is not included in the paper. Why not? There is a reference to a book chapter from 2001 in which that vignette was apparently used in some form (was it the only one by the way?) but most readers will not have direct access to it, which makes it difficult to evaluate the experiment. In other disciplines, such as cognitive psychology, it has been common for decades to include (examples of) stimuli with articles. Did the reviewers see the vignette? If not, how could they evaluate the experiments?
(4) The subjects (university students from a variety of fields) were to judge the morality of the male character’s actions (date rape) on a scale from 1 (completely right) to 100 (completely wrong). Afterwards, they received the question “How much do you believe in science?” For this a 7-point scale was used. Why a 100-point scale in one case and a 7-point scale in the other? The authors may have good reasons for this but they play it close to the vest on this one.
(5) In analyzing the results, the authors classify the students’ field of study as a science or a non-science. Psychology was ranked among the sciences (with physics, chemistry, and biology) but sociology was deemed a non-science. Why? I hope the authors have no friends in the sociology department. Communication was also classified as a non-science. Why? I know many communication researchers who would take issue with this. The point is, this division seems rather arbitrary and provides the researchers with several degrees of freedom.
(6) The authors report a correlation of r=.36, p=.011. What happens to the correlation if, for example, sociology is ranked among the sciences?
(7) Why were no averages per field reported, or at least a scatterplot? Without all this relevant information, the correlation seems meaningless at best. Weren't the reviewers interested in this information? And how about the editor?
(8) Isn’t it ironic that the historians and philosophers, who in the introduction were credited with having introduced the notion of science as moral force in society are now hypothesized to be less moral than others (after all, they were ranked among the non-scientists)? This may seem like a trivial point but it really is not when you think about it.
(9) Study 2 uses the vaunted “sentence-unscrambling task” to prime the concept of “science.” You could devote an entire blog post to this task but I will move on only to make a brief observation. The prime words were laboratory, scientists, hypothesis, theory, and logical. The control words were…. Well what were they? The paper isn’t clear about it but it looks like paper and shoes were two of them (there’s no way to tell for sure and apparently no one was interested in finding out). 
(10) Why were the control words not low-frequency long words (assuming shoe and paper are representative for this category) that are low in imageability like the primes? Now the primes stick out like a sore thumb among the other words from which a sentence has to be formed whereas the control words are a much closer fit.
(11) Doesn’t this make the task easier in the control condition? If so, there is another confound.
(12) Were the control words thematically related, like the primes obviously were?
(13) If so, what was the theme? If not, doesn’t it create a confound to have salient words in the prime condition that are thematically related and can never be used in the sentence and to have non-salient words in the control condition that are not thematically related?
(14) Did the researchers inquire after the subjects’ perceptions of the task? Weren't the reviewers and editor curious about this?
(15) Wouldn’t these subjects have picked up on the scientific theme of the primes?
(16) Wouldn’t this have affected their perceptions of the experiment in any way?
(17) What about the results? What about them indeed? Before we can proceed, we need to clear up a tiny issue. It turns out that there are a few booboos in the article. An astute commenter on the paper had noticed anomalies in the results of the study and some impossibly large effect sizes. The first author responded with a string of corrections. In fact, no fewer than 18 of the values reported in the paper were incorrect. Here, I’ve indicated them for you.

You will not find them in the article itself. The corrections can be found in the comment section.
(18) It is good thing that PLoS ONE has a comment section of course. But the question is this. Shouldn’t such extensive corrections have been incorporated in the paper itself? People who download the pdf version of the article will not know that pretty much all the numbers that are reported in the paper are wrong. That these numbers are wrong is the author’s fault but at least she was forthcoming in providing the corrections. It would seem to be the editor's and publisher's responsibility to make sure the reader has easy access to the correct information. The authors would also be served well by this. 
(19) In her correction (which comprises about 25% the size of the original paper), the first author explains that the first three studies were reran because the reviewer requested different, more straightforward dependent variables that directly assessed morality judgments rather than related judgments related to punitiveness or blame, or that were too closely tied to the domain of science, which were used in the original submission. Apparently, many of the errors occurred because the manuscript was not properly updated with the new information. Why did the reviewers and editor miss all of these inconsistencies, though?
(20) And what happened to the discarded experiments? Surely they could have been included along with the new experiments? There are no word limitations at PLoS ONE.  Having authored a 14-experiment paper that was recently published in this journal, I'm pretty sure I'm right on this one.

Let’s return to the paper armed with the correct (or so we assume) results.

(21) The subjects in Study 2 were primed with “science” or read the neutral words (which were not provided to the reader) and then read the date rape vignette (which was not provided to the reader) and made moral judgments about the actions in the vignette (whatever they were). The corrected data show that the subjects in the experimental condition rated the actions as more immoral than did the control condition. However, as the correction also states, the standard deviation was much higher in the control condition (28.02) than in the experimental condition (7.96). These variances are highly unequal; doesn’t this compromise the t-test that was reported?
(22) The corrections mention that the high variance in the neutral condition is caused by two subjects, one giving the date rape a 10 on the 100-point scale (in other words, finding it highly acceptable) and the other a 40. The average for that condition is 81.57, so aren’t these outliers, at least the 10 score? (By the way, was this date-rape approving subject reported to the relevant authorities?)
(23) In Study 3 subjects received the same priming manipulation as in Experiment 2 and they rated the likelihood that they would engage in one of the several activities the next month, some of which were prosocial, some which were not. The prosocial actions listed were giving to charity, giving blood, and volunteering. Were these all the actions that were used in the experiment? It is not clear from the paper.
(24) Were the values that were used in the statistical test the averages of the responses to the categories of items (e.g., the average rating for the three prosocial actions)?
(25) And what happened to the non-prosocial activities? Shouldn't a proper analysis have included those in a 2 (prime) by 2 (type of activity) ANOVA? 
(26) If this analysis is performed, is the interaction significant?
(27) In the corrected data the effect size is .85. Doesn’t this seem huge? Readers of my previous post already know the answer: Yes, to the untrained eye perhaps but it is the industry standard (Step 7 in that post).
(28) The corrections state that Study 4 originally contained a third condition but that it was left out at the behest of a reviewer who felt that it muddles rather than clarifies the findings (yes, we wouldn’t want the findings to be muddled, would we?). I appreciate the honesty but was everyone, including the editor, on board with this serious amputation?
(29) The initial version of the corrections (yes, I forgot to mention that there were two versions of corrections) mentioned that there were 26 participants in the control condition and 17 in the experimental condition. Where does this huge discrepancy come from? And does it affect the analyses?
(30) In the discussion it is mentioned that Study 2 investigated academic dishonesty. This was one of the experiments that was dropped, right? Another (minor) addition for the corrections perhaps.

I guess there are a great many more questions to ask but let me stop here. The article uses logical, hypothesis, theory, laboratory, and scientist as primes. I can make a sentence out of those: Absent a theory, it is logical that there is no basis for the hypothesis that was tested in the laboratory and (sloppily) reported by the scientist

[Update, April 10, 2014. As I found out only recently (if you're forming a rapid response team, don't forget not to invite me), back in September of last year, the first author of the PLoS ONE article addressed (most of) these questions in the comments section of that article. The response provides more information and acknowledges some weaknesses of the study.]