Thursday, April 24, 2014

Why do We Make Gestures (Even when No One Can See Them)?

The gesture doesn't work all the time
Why do we gesture? An obvious answer is that we gesture to communicate. After having just taken down his opponent in a two-legged flying tackle, the soccer player puts on his most innocent face while making a perfect sphere with his hands. This gesture conveys the following thought: “Ref, I was playing the ball! Sure, my opponent may be lying there writhing in pain and will soon be carried off on a stretcher but that’s beside the point. I do not deserve a red card.”

But we also gesture when our conversation partner cannot see us. Years ago I saw a madwoman walking in the Atlanta airport. See seemed to be talking to no one in particular while gesticulating vehemently. For a moment I was worried she might pull out a machine gun and mow us all down. But when she got closer I noticed she was speaking into a little microphone that was connected to her mobile phone (a novelty at the time). Evidently, the person that was on the receiving end of her tirade could not see her maniacal gestures.

So why do we gesture when no one can see our hands? According to one explanation, such gestures are merely habitual. We’re used to making gestures, so we keep on making them anyway even when nobody can see them. It’s a bit like a child on a tricycle. He lifts his legs but the pedals keep rotating. There is motion but it is not functional. The problem with this explanation is that it implies that we expend a lot of energy on a useless activity.

An alternative explanation proposes that gesturing is functional. It helps us retrieve words from our mental lexicon. The speaker says “At the fair we went into a uh…” He falls silent and makes a circular motion with his hand. He then looks relieved and finishes the sentence with “Ferris wheel.” The idea here is that the motoric information that drives our gestures is somehow connected with our word representations in our mental lexicon. The latter get activated when the gesture is made. Though plausible, a problem with this explanation is that is does not specify why the gesture is needed in the first place. If the motor program that drives the gesture is already present in the brain, then why loop out of the brain to make the gesture?

In a paper coming out today in Frontiers in Cognitive Science, a group of us—spearheaded by graduate student Wim Pouw—ventures an answer to this question.* People make noncommunicative gestures to reduce memory load and to externally manipulate information. We need to keep concepts active across stretches of time while performing a task, for instance solving a problem or conversing over the telephone. Rather than relying on memory to keep this information active, we outsource this task to the hands. They provide proprioceptive and visuospatial information that sustains access to a concept over time and allow us to perform operations on it (for instance manual rotation).

Support for this proposal comes from several sources. One is a classic paper by David Kirsh and Paul Maglio. Kirsh and Maglio observed that expert Tetris players often rotate objects on the screen before inserting them into their slots. They could have used mental rotation but instead prefer to rely on what Kirsh and Maglio call epistemic actions, operations on the environment to support thought.

Another line of support for our proposal comes from research on abacus users. People who learned arithmetic on an abacus make bead-moving finger gestures during mental arithmetic, when there is no abacus available. The better they are at mental arithmetic, the fewer gestures they make. This is consistent with the notion that noncommunicative gestures are epistemic actions that serve to reduce memory load. When you’re better at a task, performing it requires fewer memory resources, so you need to rely less on gestures.

So the next time you see people make gestures to no one in particular, you know that they’re just giving their memory a little hand. And if you want to know more about this topic, just read our paper.

* Our proposal was inspired by work by Richard Wesp and colleagues and by Andy Clark (see paper for references).

Tuesday, April 8, 2014

The Undead Findings are Among Us

A few months ago, I was asked to review a manuscript on social-behavioral priming. There were many things to be surprised about in this manuscript, not the least of which was that it cited several papers by Diederik Stapel. These papers had already been retracted, of course, which I duly mentioned in my review.  It has been said that psychology is A Vast Graveyard of Undead Theories. These post-retraction Stapel citations suggests that this cemetery might be haunted by various undead findings (actually, if they were fabricated, they weren’t really alive in the first place but let's not split semantic hairs).

There are several reasons why someone might cite a retracted paper. The most obvious reason is that they don’t know the paper has been retracted. Although the word RETRACTED is splashed across the first page of the journal version of the article, it will likely be absent on other versions that can still be found on the internet. Researchers working with such a version, might be forgiven for being unaware of the retraction.

But citing Stapel??? It is not like the press, bloggers, colleagues at the water cooler, the guy next to you on the plane, and not to mention Retractionwatch haven’t been all over this case!

A second reason for citing a retracted article is obviously to point out the very fact that that paper has been retracted. Nevertheless, a large proportion of citations to retracted papers are still favorable, just like the Stapel citations.

"Don't expect any help from us."
Does this imply that retracted findings have a lasting pollutive effect on our thinking? A recent study suggests they do. The Austrian researcher Tobias Greitemeyer presented subjects* with findings from a now retracted study by Lawrence Sanna (remember him?). Sanna reported that elevated height (e.g., riding up escalators) led to more prosocial (prosocial being the antonym of antisocial) behavior than lowered height (e.g., riding down escalators). The findings were found to be fabricated, which is why the paper was retracted.

Greitemeyer formed three groups of subjects. He told the first two groups about the Sanna study but not the third group. All subjects then rated the relationship between physical height and prosocial behavior.

Next the subjects wrote down all their ideas about this relationship. At the end of the experiment, half of the subjects who had received the summary, the debriefing condition, learned that the article had been retracted because of fabricated data and that there was no scientific evidence for the relation between height and prosocial behavior. Subjects in the no-debriefing and the control condition did not receive this information. Finally, all three groups of subjects responded to the same two items about height and prosocial behavior that they had responded to earlier.

As you might expect, the subjects in the debriefing and no-debriefing conditions made stronger estimates about the relation on the initial test than did those in the control condition. More interesting are the responses on the second test, after the debriefing condition (but not the other two conditions) had heard about the retraction. On this test the subjects in the no-debriefing condition had the highest score.  But the crucial finding was that the debriefing condition still exhibited a stronger belief in the relation between height and prosocial behavior than did the control condition. So, the debriefing lowered belief in the relation but not sufficiently.

Greitemeyer provides an explanation for these effects. It turns out that the number of explanations that subjects gave for the relationship between height and prosocial behavior correlated significantly with post-debriefing beliefs. A subsequent analysis showed that belief perseverance in the debriefing condition appeared to be attributable to causal explanations. So retraction does not lead to forgetting and that this cognitive inertia occurs because people have generated explanations of the purported effect, which presumably lead to a more entrenched memory representation of the effect. 

But we need to be cautious in interpreting these results. First, it is only one experiment. A direct replication of these findings (plus a meta-analysis that includes the two experiments) seems in order. Second, some of the effects are rather small, particularly the important contrast between the control and the no-debriefing condition. In other words, this study is a perfect candidate for replicating up.

After a successful direct replication, conceptual replications would also be informative. As Greitemeyer himself notes, a limitation of this study is that the subjects only read a summary of the results and not the actual paper. Another is that the subjects were psychology students rather than active researchers. Having researchers read the entire paper might produce a stronger perseverance effect, as the entire paper likely provides more opportunities to generate explanations and the researchers are presumably more capable of generating such explanations than the students in the original experiment were. On the other hand, researchers might be more sensitive to retraction information than students, which would lead us to expect a smaller perseverance effect.

Greitemeyer makes another interesting point. The relation between height and prosocial behavior seems implausible to begin with. If an effect has some initial plausibility (e.g., meat eaters are jerks) retraction might not go very far in reducing belief in the relation.

So if Greitemeyer’s findings are to be believed, a retraction is no safeguard against undead findings. The wights are among us...

*The article is unfortunately paywalled

Thursday, April 3, 2014

Replicating Down vs. Replicating Up

More and more people are involved in replication research. This is a good thing.

Why conduct replication experiments? A major motivation for recent replication attempts appears to have been because there are serious doubts about certain findings. On that view, unsuccessful replications serve to reduce the initially observed effect size into oblivion. I call this replicating down. Meta-analytically speaking, the aggregate effect size becomes smaller with each replication attempt and confidence in the original finding will dwindle accordingly (or so we would like to think). But the original finding will not disappear from the literature.

 No, I'm not Noam Chomsky
Replicating down is definitely a useful endeavor but it can be quite discouraging. You’re conducting an experiment that you are convinced doesn’t make any sense at all. Suppose someone conducted a priming study inspired by a famous quote from Woody Allen’s Husbands and Wives: I can't listen to that much Wagner. I start getting the urge to conquer Poland. Subjects were primed with Wagner or a control composer (Debussy?) and then completed an Urge-to-Conquer-Poland scale. The researchers found that the urge-tot-conquer-Poland was much greater in the Wagner than in the Debussy condition (in that condition, however, people scored remarkably higher on the Desire-to-Walk-Around-with-Baguettes scale). The effect size was large, d=1. If you are going to replicate this and think the result is bogus, then you’re using valuable time and resources that could have been spent toward novel experiments. Plus you might feel silly performing the experiment. The whole enterprise might feel all the more discouraging because you are running the experiment with twice or more the number of subjects that were used in the original study: an exercise in futility but with double the effort.

Other replication attempts are conducted because replicators have at least some confidence in the original finding (and in the method that produced it) but want to establish how robust it is. We might call this replicating up. A successful replication attempt shores up the original finding by yielding similar results and providing a more robust estimate of the effect size. But how is this replicating up? Glad you asked. Up doesn’t mean enlarging the effect size but it means raising the confidence we can have in the effect.

So while replicating down is certainly a noble and useful enterprise, a case could be made for replicating up as well. A recent nice example appears in a special topics section of Frontiers in Cognition that I’m co-editing. My colleagues Peter Verkoeijen and Samantha Bouwmeester performed a replication of an experiment by Kornell and Bjork (2008) that was published in Psychological Science. This experiment compared spaced (or actually “interleaved”) and massed practice in learning painting styles. In the massed practice condition, subjects saw blocks of six paintings by the same artist. In the spaced condition, each block contained six paintings by six different artists. Afterwards, they participated in a recognition test. Intuitively you would think that massed practice would be more effective. Kornell and Bjork thought this initially, as do the subjects in the experiments. Kornell and Bjork were therefore surprised to find that interleaved practice was actually more effective.

Verkoeijen and Bouwmeester replicated one of Kornell and Bjork’s experiments. One difference from the original experiment, which was run in the lab, was that the replication was run on Mechanical Turk. However, given that several other replication projects had shown no major differences between MTurk  experiments and lab experiments, there was no reason to think the effect could not be found in an online experiment. As Verkoeijen and Bouwmeester note:

For one, nowhere in their original paper do Kornell and Bjork (2008) indicate that specific sample characteristics are required to obtain a spacing effect in inductive learning. Secondly, replicating the effect with a sample from a more heterogeneous population than the relatively homogeneous undergraduate population would constitute evidence for the robustness and generality of the spacing effect in inductive learning and, therefore, would rule out that the effect is restricted to a rather specific and narrow population.

To cut to the chase, the replication attempt was successful (read the paper for a thoughtful discussion on this). Just as in the original study, the replication found a significant benefit for interleaved over massed practice. The effect sizes for the two experiments were quite similar. As the authors put it:

Our results clearly buttress those of Kornell and Bjork (2008) and taken together they suggest that spacing is indeed beneficial in inductive learning.

This is a nice example of replicating up. Moreover, the experiment has now been brought to a platform (MTurk) where any researcher can easily and quickly run replication attempts.

It seems that I’ve basically sung the virtues of successful replication. After all, isn’t any successful replication an upward replication? Of course it is. But I’m not talking about the outcome of the replication project. I’m talking about the motivation for initiating it. Replicating down and replicating up are both useful but in the long run upward replication is going to prove more useful (and less frustrating).

Perhaps a top-tier of journals should be created for solid findings in psychology (see Lakens & Koole, 2012 for a similar proposal). This type of journal would only publish findings that have been thoroughly replicated. The fairest way to go about this would be to have the original authors as first authors and the replicators as co-authors. Rather than trying to remove nonreplicable findings from the literature via downward replication, upward replication basically creates a new level in the literature, entrance to which only can be gained via upward replication. 

(I thank Peter Verkoeijen for pointing me toward the Woody Allen quote)

[update April 22, 2014: in my next post I discuss a study that would be a good candidate for replicating up.]

Friday, February 7, 2014

Back to the Future

One of my last actions as Editor-in-Chief of Acta Psychologica was to accept a manuscript for publication that is very timely given the current “crisis of confidence” in psychology. One of the paper's key points is that it is crucial to distinguish between hypothesis testing and exploratory research.

The paper chimes in with many critics of current practices in psychology when it asserts that [i]t is essential that these hypotheses have been precisely formulated and that the details of the testing procedure (which should be as objective as possible) have been registered in advance. Several journals, such as Cortex, as well as special issues of journals already require authors to preregister their submissions and the Open Science Framework offers an extremely user-friendly platform to do just this.

The paper makes a clear distinction between hypothesis-generating and hypothesis-testing research and argues that researchers regularly conflate the two, passing off exploratory research as confirmatory—a very apt description of current research practices. In exploring the data, the paper continues, researchers try to extract from the material what is in it but necessarily also what is accidentally in it. And thus the researcher proceeds by trying and selecting, whereby the selection is based on whether it promises to produce interesting (i.e., significant) results. By operating in this fashion the researcher is capitalizing on coincidences. The paper then goes on to explain what the problem with this practice is by using the example of rolling a die (hmm, where have we seen this example before?).

The paper ends with a rather stern conclusion: If the processing of empirically obtained material has in any way an “exploratory character”, i.e. if the attempt to let the material speak leads to ad hoc decisions in terms of processing, as described above, then this precludes the exact interpretability of possible outcomes of statistical tests. This conclusion resonates well with comments made by other researchers in the current debate. Perhaps it is not surprising then that the paper modestly acknowledges: this conclusion is not new.

Adriaan Dingemans de Groot (1914-2006)
So if the conclusion is not new, then what’s so special about this paper? Well, it was published 1956, the year that the Soviets invaded Hungary, Elvis Presley entered the music charts for the first time, and Dwight D. Eisenhower was re-elected as President of the United States. The current one, Barack Obama, hadn’t been born yet, John Lennon and Paul McCartney had not even met, and Donald Trump (probably) still had normal hair. The internet was decades into the future and so was the “crisis of confidence” in psychology.

The article that I'm talking about here was written by Adriaan de Groot, a Dutch psychologist, who became internationally famous for his research on thought in chess, which had a major influence on Nobel Prize winner Herb Simon (as well as on my former colleagues Anders Ericsson and Neil Charness). De Groot also developed an intelligence test that all Dutch children are required to take at the end of elementary school and which selects them for various tracks of higher education (I vividly remember taking that test when I was eleven).

A group of researchers from the University of Amsterdam, led by Eric-Jan Wagenmakers, has now provided an annotated English translation of De Groot’s article. This is fitting because De Groot held a professorship in research methods in the Department of Psychology at the University of Amsterdam. As I said at the beginning of this post, this article is currently in press in Acta Psychologica, which is also apt given the Dutch origin of that journal, but the author version can be downloaded here legally and for free from Wagenmakers’ site (also appropriate, given the Dutch stereotypical thriftiness).

In their annotations Wagenmakers and his colleagues make the sad observation that De Groot’s original article has been cited only twice (!) to date. I expect that this translated version will receive the number citations that the original already deserved.

Monday, January 20, 2014

Why Social-Behavioral Primers Might Want to be More Self-critical

During the investigation into the scientific conduct of Dirk Smeesters, I expressed my incredulity about some of his results to a priming expert. His response was: You don’t understand these experiments. You just have to run them a number of times before they work. I am convinced he was completely sincere.

What underlies this comment is what I’ll call the shy-animal mental model of experimentation. The effect is there; you just need to create the right circumstances to coax it out of its hiding place. But there is a more appropriate model: the 20-sided-die model (I admit, that’s pretty spherical for a die but bear with me).

A social-behavioral priming experiment is like rolling a 20-sided die, an icosahedron. If you roll the die a number of times, 20 will turn up at some point. Bingo! You have a significant effect. In fact, given what we now know about questionable and not so questionable research practices, it is fair to assume that the researchers are actually rolling with a 20-sided die where maybe as many as six sides have a 20 on them. So the chances of rolling a 20 are quite high.

I didn't know they existed 
but a student who read this post 
brought this specimen to class; she
uses it for Gatherer games.
Once the researchers have rolled a 20, their next interpretive move is to consider the circumstances that happened to coincide with rolling the die instrumental in producing the 20. The only problem is that they don't know what those circumstances were. Was it the physical condition of the roller? Was it the weather? Was it the time of day? Was it the color of the roller's sweater? Was it the type of microbrew he had the night before? Was it the bout of road rage he experienced that morning? Was it the cubicle in which the rolling experiment took place? Was it the fact that the roller was a 23-year old male from Michigan? And so on.

Now suppose that someone else tries to faithfully recreate the circumstances that co-occurred with the rolling of the 20, from the information that was provided by the original rollers. They recruit a 23-year old male roller from Michigan, wait until the outside temperature is exactly 17 degrees Celsius, make the experimenter wear a green sweater, have him drink the same IPA on the night before, and so on.

Then comes the big moment. He rolls the die. Unfortunately, a different number comes up— a disappointing 11. Sadly, he did not replicate the original roll. He tells this to the first roller, who replies: Yes you got a different number than we did but that’s because of all kinds of extraneous factors that we didn’t tell you about because we don’t know what they are. So it doesn’t make sense for you to try replicate our roll because we don’t know why we got the 20 in the first place! Nevertheless, our 20 stands and counts as an important scientific finding.

That is pretty much the tenor of some contributions in a recent issue of Perspectives on Psychological Science that downplay the replication crisis in social-behavioral priming. This kind of reasoning seems to motivate recent attempts by social-behavioral priming researchers to explain away an increasing number of non-replications of their experiments.

Joe Cesario, for example, claims that replications of social-behavioral priming experiments by other researchers are uninformative because any failed replication could result from moderation, although a theory of the moderators is lacking. Cesario argues that initially only the originating lab should try to replicate its findings. Self-replication is in and of itself a good idea (we have started doing it regularly in our own lab) but as Dan Simons rightfully remarks in his contribution to the special section: The idea that only the originating lab can meaningfully replicate an effect limits the scope of our findings to the point of being uninteresting and unfalsifiable.

Show-off! You're still a "false positive."
Ap Dijksterhuis also mounts a defense of priming research, downplaying the number of non-replicated findings. He talks about the odd false positive, which sounds a little like saying that a penguin colony contains the odd flightless bird (I know, I know, I'm exaggerating here). Dijksterhuis claims that it is not surprising that social priming experiments yield larger effects than semantic priming experiments because the manipulations are bolder. But if this were true, wouldn’t we expect social priming effects to replicate more often? After all, semantic priming effects do; they are weatherproof, whereas the supposedly bold social-behavioral effects appear sensitive to such things as weather conditions (which Dijksterhuis lists as a moderator).

Andrew Gelman made an excellent point in response to my previous post that false positive is actually not a very appropriate terminology. He suggests an alternative phrasing: overestimating the effect size. This seems a constructive perspective on social-behavioral priming without any negative connotations. Earlier studies provide inflated estimations of the size of social-behavioral priming effects.

A less defensive and more constructive response by priming researchers might therefore be: “Yes, the critics have a point. Our earlier studies may have indeed overestimated the effect sizes. Nevertheless, the notion of social-behavioral priming is theoretically plausible, so we need to develop better experiments, pre-register our experiments, and perform cross-lab replications to convince ourselves and our critics of the viability of social-behavioral priming as a theoretical construct.“

In his description of Cargo Cult Science, Richard Feynman stresses need for researchers to be self-critical: We've learned from experience that the truth will come out. Other experimenters will repeat your experiment and find out whether you were wrong or right. Nature's phenomena will agree or they'll disagree with your theory. And, although you may gain some temporary fame and excitement, you will not gain a good reputation as a scientist if you haven't tried to be very careful in this kind of work. And it's this type of integrity, this kind of care not to fool yourself, that is missing to a large extent in much of the research in Cargo Cult Science.

It is in the interest of the next generation of priming researchers (just to mention one important group) to be concerned about the many nonreplications (coupled with the large effect sizes and small samples that are characteristic of social-behavioral priming experiments). The lesson is that the existing paradigms are not going to yield further insight and ought to be abandoned. After all, they may have led to overestimated priming effects.

I’m reminded of the Smeesters case again. Smeesters had published a paper in which he had performed variations on the professor-prime effect, reporting large effects (the effects that prompted my incredulity). This paper has now been retracted. One of his graduate students had performed yet another variation on the professor-prime experiment; she found complete noise. When we examined her raw data, the pattern was nothing like the pattern Uri Simonsohn had uncovered in Smeesters’ own data. When confronted with the discrepancy between the two data sets, Smeesters gave the defense we see echoed in the social-behavioral priming defense discussed here: that experiment was completely different from my experiments (he did not specify how), so of course no effect was found.

There is reason to worry that defensive responses about replication failures will harm the next generation of social-behavioral priming researchers because these young researchers will be misled into placing much more confidence in a research paradigm than is warranted. Along the way they will probably waste a lot of valuable time, face lots of disappointments, and might even face the temptation of questionable research practices. They deserve better. 

Sunday, January 12, 2014

Escaping from the Garden of Forking Paths

My previous post was prompted by a new paper by Andrew Gelman and Eric Loken (GL) but it did not discuss its the main thrust because I had planned to defer that discussion to the present post. However, several comments on the previous post (by Chris Chambers and Andrew Gelman himself) leapt ahead of the game and so there already is an entire discussion in the comment section of the previous post about the topic of our story here. But I’m putting the pedal to the metal to come out in front again.

Simply put, GL’s basic claim is that researchers often unknowingly create false positives. Or, in their words: it is possible to have multiple potential comparisons, in the sense of a data analysis whose details are highly contingent on data, without the researcher performing any conscious procedure of fishing or examining multiple p-values.

My copy of the Dutch Translation
Here is one way in which this might work. Suppose we have a hypothesis that two groups differ from each other and we have two dependent measures. What constitutes evidence for our hypothesis? If the hypothesis is not more specific than that, we could be tempted to interpret a main effect as evidence for the hypothesis. If we find an interaction with the two groups differing on only one of the two measures, then we would also count that as evidence. So now we actually had three bites at the apple but we’re working under the assumption that we only had one. And this is all because our hypothesis was rather unspecific.

GL characterize the problem succinctly: There is a one-to-many mapping from scientific to statistical hypotheses. I would venture to guess that this form of inadvertent p-hacking is extremely common in psychology, perhaps especially in applied areas, where the research is less theory-driven than in basic research. The researchers may not be deliberately p-hacking, but they’re increasing the incidence of false positives nonetheless.

In his comment on the previous post, Chris Chambers argues that this constitutes a form of HARKING (Hypothesizing After the Results are Known). This is true. However, this is a very subtle form of HARKING. The researcher isn’t thinking well, I really didn’t predict this but Daryl Bem has told (pp. 2-3) me that I need to go on a fishing expedition about the data, so I’ll make it look like I’d predicted this pattern all along. The researcher is simply working with a hypothesis that is consistent with several potential patterns in the data.

GL noted that articles that they had previously characterized as the product of fishing expeditions might actually have a more innocuous explanation, namely inadvertent p-hacking. In the comments on my previous post, Chris Chambers took issue with this conclusion. He argued that GL looked at the study, and the field in general, through rose-tinted glasses.

The point of my previous post was that we often cannot reverse-engineer from the published results the processes that generated them on the basis of a single study. We cannot know for sure whether the authors of the studies initially accused by Gelman of having gone on a fishing expedition really cast out their nets or whether they arrived at their results in the innocuous way GL describe in their paper, although GL now assume it was the latter. Chris Chambers may be right when he says this picture is on the rosy side. My point, however, is that we cannot know given the information provided to us. There often simply aren’t enough constraints to make inferences about the procedures that have led to the results of a single study.

However, I take something different from the GL paper. Even though we cannot know for sure whether a particular set of published results was the product of deliberate or inadvertent p-hacking, it seems extremely likely that, overall, many researchers fall prey to inadvertent p-hacking. This is a source of false positives that we as researchers, reviewers, editors, and post-publication reviewers need to guard against. Even if researchers are on their best behavior, they still might produce false positives. GL provide suggestions to remedy the problem, namely pre-registration but point out that this may not always be an option in applied research. It is, however, in experimental research.

GL have very aptly named their article after a story by the Argentinean writer Jorge Luis Borges (who happens to be one of my favorite authors): The Garden of Forking Paths. As is characteristic of Borges, the story contains the description of another story. The embedded story describes a world where an event does not lead to a single outcome; rather, all of its possible outcomes materialize at the same time. And then the events multiply at an alarming rate as each new event spawns a plethora of other ones.

I found myself in a kind of garden of forking paths when my previous post produced both responses to that post and responses I had anticipated after this post. I’m not sure it will be as easy for the field to escape from the garden as it was for me here, but we should definitely try.

Thursday, January 9, 2014

Donald Trump’s Hair and Implausible Patterns of Results

In the past few years, a set of new terms has become common parlance in post-publication discourse in psychology and other social sciences: sloppy science, questionable research practices, researcher degrees of freedom, fishing expeditions, and data that are too-good-to-be-true. An excellent new paper by Andrew Gelman and Eric Loken takes a critical look at this development. The authors point out that they regret having used the term fishing expedition in a previous article that contained critical analyses of published work.

The problem with such terminology, they assert, is that it implies conscious actions on the part of the researchers even though—as they are careful to point out--the people who have coined, or are using, those terms (this includes me) may not think in terms of conscious agency. The main point Gelman and Loken make in the article is that there are various ways in which researchers can unconsciously inflate effects. I will write more about this in a later post. I want to focus on the nomenclature issue here. Gelman and Loken are right that despite the post-publication reviewers’ best intentions, the terms they use do evoke conscious agency.

We need to distinguish between post-publication review and ethics investigations in this regard, as these activities have different goals. Scientific integrity committees are charged with investigating the potential wrongdoings of scientists; they need to reverse-engineer behavior from the information at their disposal (published data, raw data, interviews with the researcher, their collaborators, and so on). Post-publication review is not about research practices. It is about published results and the conclusions that can or cannot be drawn from them.

If we accept this division of labor, then we need to agree with Gelman and Loken that the current nomenclature is not well suited for post-publication review. Actions cannot be unambiguously reverse-engineered from the published data. Let me give a linguistic example to illustrate. Take the sentence Visiting relatives can be frustrating. Without further context, it is impossible to know which process has given rise to this utterance. The sentence is a standing ambiguity and any Chomskyan linguist will tell you that it has one surface structure (the actual sentence) and two deep structures (meanings). The sentence can mean that it is frustrating to visit relatives or that it is frustrating when they are visiting you. There is no way to tell which deep structure has given rise to this surface structure.

It is the same with published data. Are the results the outcome of a stroke of luck, optional stopping, selective removal of data, selective reporting, an honest error, or outright fraud? This is often difficult to tell and probably not something that ought to be discussed in post-publication discourse anyway.

So the problem is that the current nomenclature generally brings to mind agency. Take sloppy science. It implies that the researcher has failed to exert an appropriate amount of care and attention; science itself cannot be sloppy. As Gelman and Loken point out, p-hacking is not necessarily intended to mean that someone deliberately bent the rules (and, in fact, their article is about how researchers unwittingly inflate the effects they report; more about this interesting idea in a later post). However, the verb implies actions on the part of the researcher; it is not a description of the results of a study. The same is true, of course, of fishing expedition. It is the researchers who are going on a fishing expedition; it is not the data who have cast their lines. Questionable research practices is obviously a statement about the researcher, as is researcher degrees of freedom.

But how about too-good-to-be-true? Clearly this qualifies as a statement about the data and not about the researcher. Uri Simonsohn used it to describe the data of Dirk Smeesters and the Scientific Integrity Committee I chaired adopted this characterization as well. Still, it has a distinctly negative connotation. Frankly, the first thing I think of when I hear too-good-to-be-true is Donald Trumps hair. And let’s face it: no researcher on this planet wants to be associated—however remotely—with Donald Trump’s hair. 

What we need for post-publication review is a term that does not imply agency or refer to the researcher—we cannot reverse engineer behavior from the published data—and that does not have a negative connotation. A candidate is implausible pattern of results (IPR). Granted, researchers will not be overjoyed when someone calls their results implausible but the term does not imply any wrongdoing on their part and yet does express a concern about the data.

But who am I to propose a new nomenclature? If readers of this blog have better suggestions, I’d love to hear them.