Tuesday, December 8, 2015

Stepping in as Reviewers

Some years ago, when I served on the Academic Integrity Committee investigating allegations of fraud against Dirk Smeesters, it fell upon me to examine Word documents of some of his manuscripts (the few that were not “lost”). The “track changes” feature afforded me a glimpse of earlier versions of the manuscripts as well as of comments made by Smeesters and his co-authors. One thing that became immediately obvious was that while all authors had contributed to the introduction and discussion sections, Smeesters alone had pasted in the results sections. Sometimes, the results elicited comments from his co-authors: “Oh, I didn’t know we also collected these measures” to which Smeesters replied something like “Yeah, that’s what I routinely do.” Another comment I vividly remember is: “Wow, these results look even better than we expected. We’re smarter than we thought!” More than a little ironic in retrospect.

On the one hand I found these discoveries reassuring. I had spent many hours talking in person or via Skype with some of Smeesters’ co-authors. Their anguish was palpable and had given even me a few sleepless nights. The Word documents seemed to exonerate them. We had asked Smeesters to indicate for each study who had had access to the data, which he had dutifully done. For each study deemed problematic, he indicated he had sole access to the data and the Word documents confirmed this. I was relieved on behalf of the co-authors.

On the other hand, I found the co-authors’ lack of access to the data disturbing. You could fault them for apparently being uninterested in seeing the data and Smeesters for not sharing them. But how common it is to share data among co-authors anyway, I wondered? Smeesters obviously had his reasons for not sharing the data but there are also far more innocent reasons why co-authors may not want to share the data. For example, researchers may find it unpleasant to have somebody looking over their shoulder, as it might imply a perceived a lack of competence on their part. “I’m a Ph.D. now, I can analyze my own data, thank you very much.” Not trying to cause offense may make co-authors reluctant to ask for the data. Sometimes, the researcher analyzing the data may have used idiosyncratic steps in the process that are not easy to follow by others. Sharing the data would be onerous for such a person because this would require making every step that is second nature to them explicit for the benefit someone else. The perceived burdensomeness of the task could be another barrier against sharing data.

If there are barriers against sharing data among co-authors, then one might expect that the barriers against sharing data with third parties, such as reviewers, and other interested researchers are substantially higher. Indeed, this turns out to be the case in psychology, even after the turmoil that the field has recently gone through.

It seems that we like to play it close to the vest when our data are concerned. But science is not a poker game. When we take a few steps back from our own concerns, this becomes clear. We need to back up our claims with data and not with a poker face. We also have a responsibility towards our fellow researchers. Sure, they may be our competitors in some respects but together we’re in the business of knowledge acquisition. This process is greatly facilitated when there is open access to data. And finally, we have a responsibility to the society at large, which funds our research.

For these reasons, I’m proud to be part of the Peer Reviewers’ Openness Initiative. The basic idea behind the Initiative is that reviewers can step in to enhance the openness of our science. They do this by pledging not to offer comprehensive review for, nor recommend the publication of, any manuscript that does not meet several minimal requirements, which you can find on the website. I’ll just highlight three of them here.

(1) The data should be made publicly available.

We just discussed this.

(2) The stimuli should be made publicly available.

Just as we all benefit from access to data, we also benefit from access to stimulus materials. I cannot speak for other areas in psychology but in cognitive psychology, the sharing of stimuli has been common for decades. Back in the day it was not possible to have a printed journal article with a 10-page appendix with stimulus materials. Authors would provide a few sample stimuli and there would be a note that the complete materials were available from the corresponding author upon request. In my experience, the stimuli were always promptly sent when requested. Since the advent of the internet, there are no physical or financial limits to posting stimuli. At least for cognitive psychologists, therefore, this second PRO requirement should not be different from what is already common practice in the area.

(3) If there are reasons why data and/or stimuli cannot be shared, these should be specified.

It is important to note here that under the PRO Initiative, reviewers provide no evaluation of these reasons. In other words, under the PRO Initiative, reviewers are by no means arbiters of what counts as a valid reason and what not. The only requirement is that the reasons become part of the scientific record.

My father was a chain smoker for most of his life until he declared at one point:  “smoking is a filthy habit!” (“You can say that again!” I remember replying.) After this epiphany, my father never touched a cigarette again. I hope that the PRO Initiative will contribute to the field reaching a similar epiphany about lack of openness.  

If you’ve already had this epiphany, you may wish to sign the Initiative here.

For other views related to the Initiative, see blog posts by Richard Morey and Candice Morey.


Thursday, June 25, 2015

Diederik Stapel and the Effort After Meaning

Sir Frederic, back when professors still
looked like professors.
Take a look at these sentences:

A burning cigarette was carelessly discarded.
Several acres of virgin forest were destroyed.

You could let them stand as two unrelated utterances. But that’s not what you did, right? You inferred that the cigarette caused a fire, which destroyed the forest. We interpret new information based on what we know (that burning cigarettes can cause fires) to form a coherent representation of a situation. Rather than leaving the sentences unconnected, we impose a causal connection between the events described by the sentences.

George W. Bush exploited this tendency to create coherence by continuously juxtaposing Saddam and 9-11, thus fooling three-quarters of the American public into believing that Saddam was behind the attacks, without stating this explicitly.

Sir Frederic Bartlett proposed that we are continuously engaged in an effort after meaning. This is what remembering, imagining, thinking, reasoning, and understanding are: efforts to establish coherence. We try to forge connections between what we see and what we know. Often, we encounter obstacles to coherence and we strive mightily to overcome them. 


Take for example the last episode of Game of Thrones. One of the characters, Stannis Baratheon, barely survives a battle and is shown wounded and slumped against a tree. Another character strikes at him with a sword. But right before the sword hits, there is a cut to a different scene. So is Stannis dead or not? This question is hotly debated in news groups (e.g., in this thread). The vigor of the debate is testament to people's intolerance for ambiguity and their effort after meaning.

Stannis Baratheon, will he make it or not?
The arguments pro or contra Stannis being dead are made at different levels. Some people try to resolve the ambiguity at the level of the scene. No, Stannis could not have been killed: the positioning of the characters and the tree suggests that the sword would have struck the tree rather than Stannis. Other people jump up to the level of the story world. No, Stannis cannot be dead because his arc is not complete yet. Or: yes, he is dead because there is nothing anymore for him to accomplish in the story—let’s face it, he even sacrificed his own daughter, so what does he have left to live for! Yet other people take the perspective of the show. No, he is not dead because so far every major character on the show that is dead has been shown to have been killed; there are no off-screen deaths. Finally, some people take a very practical view. No Stannis cannot be dead because the actor, Stephen Dillane, is still under contract at HBO.

The internet is replete with discussions of this type, on countless topics, from interpretations of Beatles lyrics to conspiracy theories about 9-11. All are manifestations of the effort after meaning.

Science is another case in point. In a recent interview in the Chronicle for Higher Education, Diederik Stapel tries to shed light on his own fraud by appealing to the effort after meaning:

I think the problem with some scientists […], is you’re really genuinely interested. You really want to understand what’s going on. Understanding means I want to understand, I want an answer. When reality gives you back something that’s chaos and is not easy to understand, the idea of being a scientist is that you need to dig deeper, you need to find an answer. Karl Popper says that’s what you need to be happy with — uncertainty — maybe that’s the answer. Yet we’re trained, and society expects us to give an answer.

You don’t have to sympathize with Stapel to see that he has a point here. Questionable research practices are ways to establish coherence between hypothesis and data, between different experiments, and between data and hypothesis. Omitting nonsignificant findings is a way to establish coherence between hypothesis and data and among experiments. You can also establish coherence between data and hypothesis simply by inventing a new hypothesis in light of the data and pretending it was your hypothesis all along (HARKing). And if you don’t do any of these things and submit a paper with data that don’t allow you to tell a completely coherent story, your manuscript is likely to get rejected.

So the effort after meaning is systemic in science. As Stapel says, when nature does not cooperate, there is a perception that we have failed as scientists. We have failed to come up with a coherent story and we feel the need to rectify this. Because if we don't, our work may never see the light of day.

Granted, data fabrication is taking the effort after meaning to the extreme--let’s call it the scientific equivalent of sacrificing your own daughter. Nevertheless, we would do well to acknowledge that as scientists we are beholden to the effort after meaning. The simple solution is to arrange our science such that we let the effort after meaning roam free where it is needed—in theorizing and in exploratory research—and curb it where it has no place, in confirmatory research. Preregistration is an important step toward accomplishing this.

Meanwhile, if you want to give your effort after meaning a workout, don’t hesitate to weigh in on the Stannis debate.





Thursday, May 7, 2015

p=.20, what now? Adventures of the Good Ship DataPoint

You’ve dutifully conducted a power analysis, defined your sample size, and conducted your experiment. Alas, p=.20. What now? Let’s find out.

The Good Ship DataPoint*
Perspectives on Psychological Science’s first registered replication project, RRR1, was targeted at verbal overshadowing, the phenomenon that describing a visual stimulus, in this case a human face, is detrimental to later recognition of this face compared to not describing the stimulus. A meta-analysis of  31 direct replications of the original finding provided evidence of verbal overshadowing. Subjects who described the suspect were 16% less likely to make a correct identification than subjects who performed a filler task.

One of my students wanted to extend (or conceptually replicate) the verbal overshadowing effect for her master’s thesis by using different stimuli and a different distractor task. I’m not going to talk about the contents of the research here. I simply want to address the question that’s posed in the title of this post: p=.20, what now? Because p=.20 is what we found after having run 148 subjects, obtaining a verbal overshadowing effect of 9% rather than RRR1's 16%.** 

Option 1. The effect is not significant, so this conceptual replication “did not work,” let’s file drawer the sucker. This response is probably still very common but it contributes to publication bias.

Option 2. We consider this a pilot study and now perform a power analysis based on it and run a new (and much larger) batch of subjects. The old data are now meaningless for hypothesis testing. This is better than option 1 but is rather wasteful. Why throw away a perfectly good data set?

Option 3. Our method wasn’t sensitive enough. Let’s improve it and then run a new study. Probably a very common response. But it may be premature and is not guaranteed to lead to a more decisive result. And you’re still throwing away the old data (see option 1).

Liverpool FC, victorious in the 2005 Champions League final 
in Istanbul after overcoming a 3-0 deficit against AC Milan
Option 4. The effect is not significant, but if we also report the Bayes factor, we can at least say something meaningful about the Null hypothesis and maybe get it published. This seems to become more common nowadays. It is not a bad idea as such, but it is likely to get misinterpreted as: H0 is true (even by the researchers themselves). The Bayes factor tells us something about the support for a hypothesis relative to some other hypothesis given the data such as they are. And what the data are here is: too few. We found BF10= .21, which translates to about 5 times more evidence for H0 than for H1, but this is about as meaningful as the score in a soccer match after 30 minutes of play. Sure, H0 is ahead but H1 might well score a come-from-behind victory. There are after all 60 more minutes to play! 

Option 5.  The effect is not significant but we’ll keep on testing until it is. Simmons et al. have provided a memorable illustration of how problematic optional stopping is. In his blog, Ryne Sherman describes a Monte Carlo simulation of p-hacking, showing that it can inflate the false positive rate from 5% to 20%. Still, the intuition that it would be useful to test more subjects is a good one. And that leads us to…

Option 6. The result is ambiguous, so let’s continue testing—in a way that does not inflate the Type I error rate—until we have decisive information or we've run out of resources. Researchers have proposed several ways of sequential testing that does preserve the normal error rate. Eric-Jan Wagenmakers and colleagues show how repeated testing can be performed in a Bayesian framework and DaniĆ«l Lakens has described sequential testing as it is performed in the medical sciences. My main focus will be on a little-known method proposed in psychology by Frick (1998), which to date has been cited only 17 times in Google Scholar. I will report Bayes factors as well. The method described by Lakens could not be used in this case because it requires one to specify the number of looks a priori. 

Frick’s method is called COAST (composite open adaptive sequential test). The idea is appealingly simple: if your effect is >.01 and <.36, keep on testing until the p-value crosses one of these limits.*** Frick’s simulations show that this procedure keeps the overall alpha level under .05. Given that after the first test our p was between the lower and upper limits, our Good Ship DataPoint was in deep waters. Therefore, we continued testing. We decided to add subjects in batches of 60 (barring exclusions) so as to not overshoot and yet make our additions substantive. If DataPoint failed to reach shore before we'd reached 500 subjects, we would abandon ship. 

Voyage of the Good Ship Data Point on the Rectangular Sea of Probability

Batch 2: Ntotal=202, p=.047. People who use optional testing would stop here and declare victory: p<.05! (Of course, they wouldn’t mention that they’d already peeked.) We’re using COAST, however, and although the good ship DataPoint is in the shallows of the Rectangular Sea of Probability, it has not reached the coast. And BF10=0.6, still leaning toward H0.

Batch 3: Ntotal = 258, p=.013, BF10=1.95. We’re getting encouraging reports from the crow’s nest. The DataPoint crew will likely not succumb to scurvy after all! And the BF10 now favors H1.

Batch 4: Ntotal =306, p=.058, BF10=.40. What’s this??? The wind has taken a treacherous turn and we’ve drifted  away from shore. Rations are getting low--mutiny looms. And if that wasn’t bad enough, BF is  <1 again. Discouraged but not defeated, DataPoint sails on.

Batch 5: Ntotal =359, p=.016, BF10=1.10. Heading back in the right direction again.

Batch 6: Ntotal =421, p=.015, BF=1.17. Barely closer. Will we reach shore before we all die? We have to ration the food.

Batch 7: Ntotal =479, p=.003, BF10=4.11. Made it! Just before supplies ran out and the captain would have been keelhauled. The taverns will be busy tonight.

Some lessons from this nautical exercise:

(1) More data=better.

(2) We now have successfully extended the verbal overshadowing effect, although we found a smaller effect, 9% after 148 subjects and 10% at the end of the experiment.

(3) Although COAST gave us an exit strategy, BF10=4.11 is encouraging but not very strong. And who knows if it will hold up? Up to this point it has been quite volatile.

(4) Our use of COAST worked because we were using Mechanical Turk. Adding batches of 60 subjects would be impractical in the lab.

(5) Using COAST is simple and straightforward. It preserves an overall alpha level of .05. I prefer to use it in conjunction with Bayes factors.

(6) It is puzzling that methodological solutions to a lot of our problems are right there in the psychological literature but that so few people are aware of them.

Coda

In this post, I have focused on the application of COAST and largely ignored, for didactical purposes, that this study was a conceptual replication. More about this in the next post.



Footnotes

Acknowledgements: I thank Samantha Bouwmeester, Peter Verkoeijen, and Anita Eerland for helpful comments on an earlier version of this post. They don't necessarily agree with me on all of the points raised in the post.
*Starring in the role of DataPoint is the Batavia, a replica of a 17th century Dutch East Indies ship, well worth a visit.
** The original study, Schooler and Engstler-Schooler (1990), has a sample of 37 subjects and the RRR1 studies typically had 50-80 subjects. We used chi-square tests to compute p-values. Unlike the replication studies, we did not collapse the conditions in which subjects made a false identification and in which they claimed the suspect was not in the lineup because we thought these were two different kinds of responses. I computed Bayes factors using the BayesFactor package in R. I used the contingencyTableBF function with sampleType = "indepMulti", fixedMargin = "rows", priorConcentration= 1. In this analysis, we separated false alarms from misses, unlike in the replication experiments. This precluded us, however, from using one-sided tests.
*** For this to work, you need to decide a priori to use COAST. This means, for example, that when your p-value is >.01 and <.05 after the first batch, you need to continue testing rather than conclude that you've obtained a significant effect.

Wednesday, March 11, 2015

The End-of-Semester Effect Fallacy: Some Thoughts on Many Labs 3

The Many Labs enterprise is on a roll. This week, a manuscript reporting Many Labs 3 materialized on the already invaluable Open Science Framework. The manuscript reports a large-scale investigation, involving 20 American and Canadian research teams, into the “end-of-semester effect.”

The lore among researchers is that subjects run at the end of the semester provide useless data. Effects that are found at the beginning of the semester somehow disappear or become smaller at the end. Often this is attributed to the notion that less-motivated/less-intelligent students procrastinate and postpone participation in experiments until the very last moment. Many Labs 3 notes that there is very little empirical evidence pertaining to the end-of-semester effect.

To address this shortcoming in the literature, Many Labs 3 set out to conduct 10 replications of known effects to examine the end-of-semester effect. Each experiment was performed twice by each of the 20 participating teams: once at the beginning of the semester and once at the end of the semester, each time with different subjects, of course.

It must have been a disappointment to the researchers involved that only 3 of the 10 effects replicated (maybe more about this in a later post) but Many Labs 3 remained undeterred and went ahead to examine the evidence for an end-of-semester effect. Long story short, there was none. Or in the words of the researchers:

It is possible that there are some conditions under which the time of semester impacts observed effects. However, it is unknown whether that impact is ever big enough to be meaningful

This made me wonder about the reasons for expecting an end-of-semester effect in the first place. Isn’t this just a fallacy born out of research practices that most of us now frown upon: running small samples, shelving studies with null effects, and optional stopping?

New projects are usually started at the beginning of a semester. Suppose the first (underpowered) study produces a significant effect. This can have multiple reasons:
(1) the effect is genuine;
(2)  the researchers stopped when the effect was significant;
(3) the researchers massaged the data such that the effect was significant;
(4) it was a lucky shot;
(5) any combination of the above.

How the end-of-semester effect might come about
With this shot in the arm, the researchers are motivated to conduct a second study, perhaps with the same N and exclusionary and outlier-removal criteria as the first study but with a somewhat different independent variable. Let’s call it a conceptual replication. If this study, for whatever reason, yields a significant effect, the researchers might congratulate themselves on a job well done and submit the manuscript.

But what if the first study does not produce a significant effect? The authors probably conclude that the idea is not worth pursuing after all, shelve the study, and move on to a new idea. If it’s still early in the semester, they could run a study to test the new idea and the process might repeat itself.

Now let’s assume the second study yields a null effect, certainly not a remote possibility. At this juncture, the authors are the proud owners of a Study 1 with an effect but are saddled with a Study 2 without an effect. How did they get this lemon? Well, of course because of those good-for-nothing numbskulled students who wait until the end of the semester before signing up for an experiment! And thus the the “end-of semester fallacy” is born.