Thursday, June 25, 2015

Diederik Stapel and the Effort After Meaning

Sir Frederic, back when professors still
looked like professors.
Take a look at these sentences:

A burning cigarette was carelessly discarded.
Several acres of virgin forest were destroyed.

You could let them stand as two unrelated utterances. But that’s not what you did, right? You inferred that the cigarette caused a fire, which destroyed the forest. We interpret new information based on what we know (that burning cigarettes can cause fires) to form a coherent representation of a situation. Rather than leaving the sentences unconnected, we impose a causal connection between the events described by the sentences.

George W. Bush exploited this tendency to create coherence by continuously juxtaposing Saddam and 9-11, thus fooling three-quarters of the American public into believing that Saddam was behind the attacks, without stating this explicitly.

Sir Frederic Bartlett proposed that we are continuously engaged in an effort after meaning. This is what remembering, imagining, thinking, reasoning, and understanding are: efforts to establish coherence. We try to forge connections between what we see and what we know. Often, we encounter obstacles to coherence and we strive mightily to overcome them. 


Take for example the last episode of Game of Thrones. One of the characters, Stannis Baratheon, barely survives a battle and is shown wounded and slumped against a tree. Another character strikes at him with a sword. But right before the sword hits, there is a cut to a different scene. So is Stannis dead or not? This question is hotly debated in news groups (e.g., in this thread). The vigor of the debate is testament to people's intolerance for ambiguity and their effort after meaning.

Stannis Baratheon, will he make it or not?
The arguments pro or contra Stannis being dead are made at different levels. Some people try to resolve the ambiguity at the level of the scene. No, Stannis could not have been killed: the positioning of the characters and the tree suggests that the sword would have struck the tree rather than Stannis. Other people jump up to the level of the story world. No, Stannis cannot be dead because his arc is not complete yet. Or: yes, he is dead because there is nothing anymore for him to accomplish in the story—let’s face it, he even sacrificed his own daughter, so what does he have left to live for! Yet other people take the perspective of the show. No, he is not dead because so far every major character on the show that is dead has been shown to have been killed; there are no off-screen deaths. Finally, some people take a very practical view. No Stannis cannot be dead because the actor, Stephen Dillane, is still under contract at HBO.

The internet is replete with discussions of this type, on countless topics, from interpretations of Beatles lyrics to conspiracy theories about 9-11. All are manifestations of the effort after meaning.

Science is another case in point. In a recent interview in the Chronicle for Higher Education, Diederik Stapel tries to shed light on his own fraud by appealing to the effort after meaning:

I think the problem with some scientists […], is you’re really genuinely interested. You really want to understand what’s going on. Understanding means I want to understand, I want an answer. When reality gives you back something that’s chaos and is not easy to understand, the idea of being a scientist is that you need to dig deeper, you need to find an answer. Karl Popper says that’s what you need to be happy with — uncertainty — maybe that’s the answer. Yet we’re trained, and society expects us to give an answer.

You don’t have to sympathize with Stapel to see that he has a point here. Questionable research practices are ways to establish coherence between hypothesis and data, between different experiments, and between data and hypothesis. Omitting nonsignificant findings is a way to establish coherence between hypothesis and data and among experiments. You can also establish coherence between data and hypothesis simply by inventing a new hypothesis in light of the data and pretending it was your hypothesis all along (HARKing). And if you don’t do any of these things and submit a paper with data that don’t allow you to tell a completely coherent story, your manuscript is likely to get rejected.

So the effort after meaning is systemic in science. As Stapel says, when nature does not cooperate, there is a perception that we have failed as scientists. We have failed to come up with a coherent story and we feel the need to rectify this. Because if we don't, our work may never see the light of day.

Granted, data fabrication is taking the effort after meaning to the extreme--let’s call it the scientific equivalent of sacrificing your own daughter. Nevertheless, we would do well to acknowledge that as scientists we are beholden to the effort after meaning. The simple solution is to arrange our science such that we let the effort after meaning roam free where it is needed—in theorizing and in exploratory research—and curb it where it has no place, in confirmatory research. Preregistration is an important step toward accomplishing this.

Meanwhile, if you want to give your effort after meaning a workout, don’t hesitate to weigh in on the Stannis debate.





Thursday, May 7, 2015

p=.20, what now? Adventures of the Good Ship DataPoint

You’ve dutifully conducted a power analysis, defined your sample size, and conducted your experiment. Alas, p=.20. What now? Let’s find out.

The Good Ship DataPoint*
Perspectives on Psychological Science’s first registered replication project, RRR1, was targeted at verbal overshadowing, the phenomenon that describing a visual stimulus, in this case a human face, is detrimental to later recognition of this face compared to not describing the stimulus. A meta-analysis of  31 direct replications of the original finding provided evidence of verbal overshadowing. Subjects who described the suspect were 16% less likely to make a correct identification than subjects who performed a filler task.

One of my students wanted to extend (or conceptually replicate) the verbal overshadowing effect for her master’s thesis by using different stimuli and a different distractor task. I’m not going to talk about the contents of the research here. I simply want to address the question that’s posed in the title of this post: p=.20, what now? Because p=.20 is what we found after having run 148 subjects, obtaining a verbal overshadowing effect of 9% rather than RRR1's 16%.** 

Option 1. The effect is not significant, so this conceptual replication “did not work,” let’s file drawer the sucker. This response is probably still very common but it contributes to publication bias.

Option 2. We consider this a pilot study and now perform a power analysis based on it and run a new (and much larger) batch of subjects. The old data are now meaningless for hypothesis testing. This is better than option 1 but is rather wasteful. Why throw away a perfectly good data set?

Option 3. Our method wasn’t sensitive enough. Let’s improve it and then run a new study. Probably a very common response. But it may be premature and is not guaranteed to lead to a more decisive result. And you’re still throwing away the old data (see option 1).

Liverpool FC, victorious in the 2005 Champions League final 
in Istanbul after overcoming a 3-0 deficit against AC Milan
Option 4. The effect is not significant, but if we also report the Bayes factor, we can at least say something meaningful about the Null hypothesis and maybe get it published. This seems to become more common nowadays. It is not a bad idea as such, but it is likely to get misinterpreted as: H0 is true (even by the researchers themselves). The Bayes factor tells us something about the support for a hypothesis relative to some other hypothesis given the data such as they are. And what the data are here is: too few. We found BF10= .21, which translates to about 5 times more evidence for H0 than for H1, but this is about as meaningful as the score in a soccer match after 30 minutes of play. Sure, H0 is ahead but H1 might well score a come-from-behind victory. There are after all 60 more minutes to play! 

Option 5.  The effect is not significant but we’ll keep on testing until it is. Simmons et al. have provided a memorable illustration of how problematic optional stopping is. In his blog, Ryne Sherman describes a Monte Carlo simulation of p-hacking, showing that it can inflate the false positive rate from 5% to 20%. Still, the intuition that it would be useful to test more subjects is a good one. And that leads us to…

Option 6. The result is ambiguous, so let’s continue testing—in a way that does not inflate the Type I error rate—until we have decisive information or we've run out of resources. Researchers have proposed several ways of sequential testing that does preserve the normal error rate. Eric-Jan Wagenmakers and colleagues show how repeated testing can be performed in a Bayesian framework and Daniël Lakens has described sequential testing as it is performed in the medical sciences. My main focus will be on a little-known method proposed in psychology by Frick (1998), which to date has been cited only 17 times in Google Scholar. I will report Bayes factors as well. The method described by Lakens could not be used in this case because it requires one to specify the number of looks a priori. 

Frick’s method is called COAST (composite open adaptive sequential test). The idea is appealingly simple: if your effect is >.01 and <.36, keep on testing until the p-value crosses one of these limits.*** Frick’s simulations show that this procedure keeps the overall alpha level under .05. Given that after the first test our p was between the lower and upper limits, our Good Ship DataPoint was in deep waters. Therefore, we continued testing. We decided to add subjects in batches of 60 (barring exclusions) so as to not overshoot and yet make our additions substantive. If DataPoint failed to reach shore before we'd reached 500 subjects, we would abandon ship. 

Voyage of the Good Ship Data Point on the Rectangular Sea of Probability

Batch 2: Ntotal=202, p=.047. People who use optional testing would stop here and declare victory: p<.05! (Of course, they wouldn’t mention that they’d already peeked.) We’re using COAST, however, and although the good ship DataPoint is in the shallows of the Rectangular Sea of Probability, it has not reached the coast. And BF10=0.6, still leaning toward H0.

Batch 3: Ntotal = 258, p=.013, BF10=1.95. We’re getting encouraging reports from the crow’s nest. The DataPoint crew will likely not succumb to scurvy after all! And the BF10 now favors H1.

Batch 4: Ntotal =306, p=.058, BF10=.40. What’s this??? The wind has taken a treacherous turn and we’ve drifted  away from shore. Rations are getting low--mutiny looms. And if that wasn’t bad enough, BF is  <1 again. Discouraged but not defeated, DataPoint sails on.

Batch 5: Ntotal =359, p=.016, BF10=1.10. Heading back in the right direction again.

Batch 6: Ntotal =421, p=.015, BF=1.17. Barely closer. Will we reach shore before we all die? We have to ration the food.

Batch 7: Ntotal =479, p=.003, BF10=4.11. Made it! Just before supplies ran out and the captain would have been keelhauled. The taverns will be busy tonight.

Some lessons from this nautical exercise:

(1) More data=better.

(2) We now have successfully extended the verbal overshadowing effect, although we found a smaller effect, 9% after 148 subjects and 10% at the end of the experiment.

(3) Although COAST gave us an exit strategy, BF10=4.11 is encouraging but not very strong. And who knows if it will hold up? Up to this point it has been quite volatile.

(4) Our use of COAST worked because we were using Mechanical Turk. Adding batches of 60 subjects would be impractical in the lab.

(5) Using COAST is simple and straightforward. It preserves an overall alpha level of .05. I prefer to use it in conjunction with Bayes factors.

(6) It is puzzling that methodological solutions to a lot of our problems are right there in the psychological literature but that so few people are aware of them.

Coda

In this post, I have focused on the application of COAST and largely ignored, for didactical purposes, that this study was a conceptual replication. More about this in the next post.



Footnotes

Acknowledgements: I thank Samantha Bouwmeester, Peter Verkoeijen, and Anita Eerland for helpful comments on an earlier version of this post. They don't necessarily agree with me on all of the points raised in the post.
*Starring in the role of DataPoint is the Batavia, a replica of a 17th century Dutch East Indies ship, well worth a visit.
** The original study, Schooler and Engstler-Schooler (1990), has a sample of 37 subjects and the RRR1 studies typically had 50-80 subjects. We used chi-square tests to compute p-values. Unlike the replication studies, we did not collapse the conditions in which subjects made a false identification and in which they claimed the suspect was not in the lineup because we thought these were two different kinds of responses. I computed Bayes factors using the BayesFactor package in R. I used the contingencyTableBF function with sampleType = "indepMulti", fixedMargin = "rows", priorConcentration= 1. In this analysis, we separated false alarms from misses, unlike in the replication experiments. This precluded us, however, from using one-sided tests.
*** For this to work, you need to decide a priori to use COAST. This means, for example, that when your p-value is >.01 and <.05 after the first batch, you need to continue testing rather than conclude that you've obtained a significant effect.

Wednesday, March 11, 2015

The End-of-Semester Effect Fallacy: Some Thoughts on Many Labs 3

The Many Labs enterprise is on a roll. This week, a manuscript reporting Many Labs 3 materialized on the already invaluable Open Science Framework. The manuscript reports a large-scale investigation, involving 20 American and Canadian research teams, into the “end-of-semester effect.”

The lore among researchers is that subjects run at the end of the semester provide useless data. Effects that are found at the beginning of the semester somehow disappear or become smaller at the end. Often this is attributed to the notion that less-motivated/less-intelligent students procrastinate and postpone participation in experiments until the very last moment. Many Labs 3 notes that there is very little empirical evidence pertaining to the end-of-semester effect.

To address this shortcoming in the literature, Many Labs 3 set out to conduct 10 replications of known effects to examine the end-of-semester effect. Each experiment was performed twice by each of the 20 participating teams: once at the beginning of the semester and once at the end of the semester, each time with different subjects, of course.

It must have been a disappointment to the researchers involved that only 3 of the 10 effects replicated (maybe more about this in a later post) but Many Labs 3 remained undeterred and went ahead to examine the evidence for an end-of-semester effect. Long story short, there was none. Or in the words of the researchers:

It is possible that there are some conditions under which the time of semester impacts observed effects. However, it is unknown whether that impact is ever big enough to be meaningful

This made me wonder about the reasons for expecting an end-of-semester effect in the first place. Isn’t this just a fallacy born out of research practices that most of us now frown upon: running small samples, shelving studies with null effects, and optional stopping?

New projects are usually started at the beginning of a semester. Suppose the first (underpowered) study produces a significant effect. This can have multiple reasons:
(1) the effect is genuine;
(2)  the researchers stopped when the effect was significant;
(3) the researchers massaged the data such that the effect was significant;
(4) it was a lucky shot;
(5) any combination of the above.

How the end-of-semester effect might come about
With this shot in the arm, the researchers are motivated to conduct a second study, perhaps with the same N and exclusionary and outlier-removal criteria as the first study but with a somewhat different independent variable. Let’s call it a conceptual replication. If this study, for whatever reason, yields a significant effect, the researchers might congratulate themselves on a job well done and submit the manuscript.

But what if the first study does not produce a significant effect? The authors probably conclude that the idea is not worth pursuing after all, shelve the study, and move on to a new idea. If it’s still early in the semester, they could run a study to test the new idea and the process might repeat itself.

Now let’s assume the second study yields a null effect, certainly not a remote possibility. At this juncture, the authors are the proud owners of a Study 1 with an effect but are saddled with a Study 2 without an effect. How did they get this lemon? Well, of course because of those good-for-nothing numbskulled students who wait until the end of the semester before signing up for an experiment! And thus the the “end-of semester fallacy” is born.




Thursday, February 26, 2015

Can we Live without Inferential Statistics?

The journal Basic and Applied Social Psychology (BASP) has taken a resolute and bold step. A recent editorial announces that it has banned the reporting of inferential statistics. F-values, t-values, p-values and the like have all been declared personae non gratae. And so have confidence intervals. Bayes factors are not exactly banned but aren’t welcomed with open arms either; they are eyed with suspicion, like a mysterious traveler in a tavern.

There is a vigorous debate in the scientific literature and in the social media about the pros and cons of Null Hypothesis Significance Testing (NHST), confidence intervals, and Bayesian statistics (making researchers in some frontier towns quite nervous). The editors at BASP have seen enough of this debate and have decided to do away with inferential statistics altogether. Sure, you're allowed to submit a manuscript that’s loaded with p-values and statements about significance or the lack thereof, but they will be rigorously removed, like lice from a schoolchild’s head.

The question is whether we can live with what remains. Can we really conduct science without summary statements? Because what does the journal offer in their place? It requires strong descriptive statistics, distributional information, and larger samples. These are all good things but we need to have a way to summarize our results, not just because so we can comprehend and interpret them better ourselves and because we need to communicate them but also because we need to make decisions based on them as researchers, reviewers, editors, and users. Effect sizes are not banned and so will provide summary information that will be used to answer questions like:
--what will the next experiment be?
--do the findings support the hypothesis?
--has or hasn’t the finding been replicated?
--can I cite finding X as support for theory Y?*

As to that last question, you can hardly cite a result saying This finding supports or does not support the hypothesis but here are the descriptives. The reader will want more in the way of a statistical argument or an intersubjective criterion to decide one way or the other. I have no idea how researchers, reviewers, and editors are going to cope with the new freedoms (from inferential statistics) and constraints (from not being able to use inferential statistics). But that’s actually what I like about the BASP's ban. It gives rise to a very interesting real-world experiment in meta-science. 

Sneaky Bayes
There are a lot of unknowns at this point. Can we really live without inferential statistics? Will Bayes sneak in through the half-open door and occupy the premises? Will no one dare to submit to the journal? Will authors balk at having their manuscripts shorn of inferential statistics? Will the interactions among authors, reviewers, and editors yield novel and promising ways of interpreting and communicating scientific results? Will the editors in a few years be BASPing in the glory of their radical decision?  And how will we measure the success of the ban on inferential statistics? The wrong way to go about this would be to see whether the policy will be adopted by other journals or whether or not the impact factor of the journal rises. So how will we determine whether the ban will improve our science?

Questions, questions. But this is why we conduct experiments and this is why BASP's brave decision should be given the benefit of the doubt.

--------
Footnotes

I thank Samantha Bouwmeester and Anita Eerland for feedback on a previous version and Dermot Lynott for the Strider picture.

* Note that I’m not saying: “will the paper be accepted?” or “does the researcher deserve tenure?” 






Wednesday, January 28, 2015

The Dripping Stone Fallacy: Confirmation Bias in the Roman Empire and Beyond



What to do when the crops are failing because of a drought? Why, we persuade the Gods to send rain of course! I'll let the fourth Roman Emperor, Claudius, explain:

Derek Jacobi stuttering away as 
Claudius in the TV series I Claudius
There is a black stone called the Dripping Stone, captured originally from the Etruscans and stored in a temple of Mars outside the city. We go in solemn procession and fetch it within the walls, where we pour water on it, singing incantations and sacrificing. Rain always follows--unless there has been a slight mistake in the ritual, as is frequently the case.*
                                                                
It sounds an awful lot as if Claudius is weighing in on the replication debate, coming down squarely on the side of replication critics, researchers who raise the specter of hidden moderators as soon as a non-replication materializes. Obviously, when a replication attempt omits a component that is integral to the original study (and was explicitly mentioned in the original paper), omission of that component borders on scientific malpractice. But hidden moderators are only invoked after the fact--they are "hidden" after all and so could by definition not have been omitted. Hidden moderators are slight mistakes or imperfections in the ritual that are only detected when the ritual does not produce the desired outcome. As Claudius would have us believe, if the ritual is performed correctly, then rain always follows. Similarly, if there are no hidden moderators, then the effect will always occur, so if the effect does not occur, there must have been a hidden moderator.**

And of course nobody bothers to look for small errors in the ritual when it is raining cats and dogs, or for hidden moderators when p<.05

I call this the Dripping Stone Fallacy.

Reviewers (and readers) of scientific manuscripts fall prey to a mild(er) version of the Dripping Stone Fallacy. They scrutinize the method and results sections of a paper if they disagree with its conclusions and tend to give these same sections a more cursory treatment if they agree with the conclusions. Someone surely must have investigated this already. If not, it would be rather straightforward to design an experiment and test the hypothesis. One could measure the amount of time spent reading the method section and memory for it in subjects who are known to agree or disagree with the conclusions of an empirical study.

Even the greatest minds fall prey to the Dripping Stone Fallacy. As Raymond Nickerson describes: Louis Pasteur refused to accept or publish results of his experiments that seemed to tell against his position that life did not generate spontaneously, being sufficiently convinced of his hypothesis to consider any experiment that produced counterindicative evidence to be necessarily flawed.

Confirmation bias comes in many guises and the Dripping Stone Fallacy is one of them. It makes a frequent appearance in the replication debate. Granted, the Dripping Stone Fallacy didn't prevent the Romans from conquering half the world but it is likely to be more debilitating to the replication debate.


Footnotes

* Robert Graves, Claudius the God, Penguin Books, 2006, p. 172.
** This is and informal fallacy; it is formally correct (modus tollens) but is based on a false premise.