Tuesday, May 16, 2017

Sometimes You Can Step into the Same River Twice

              A recurring theme in the replication debate is the argument that certain findings don’t replicate or cannot be expected to replicate because the context in which the replication is carried out differs from the one in which the original study was performed. This argument is usually made after a failed replication.

In most such cases, the original study did not provide a set of conditions under which the effect was predicted to hold, although the original paper often did make grandiose claims about the effect’s relevance to variety of contexts including industry, politics, education, and beyond. If you fail to replicate this effect, it's a bit like you've just bought a car that was touted by the salesman as an "all-terrain vehicle," only to have the wheels come off as soon as you drive it off the lot.*

            As this automotive analogy suggests, the field has two problems: many effects (1) do not replicate and (2) are grandiosely oversold. Dan Simons, Yuichi Shoda, and Steve Lindsay have recently made a proposal that provides a practical solution to the overselling problem: researchers need to include in their paper a statement that explicitly identifies and justifies the target populations for the reported findings, a constraints on generality (COG) statement. Researchers also need to state whether they think the results are specific to the stimuli that were used and to the time and location of the experiment. Requiring authors to be specific about the constraints on generality is a good idea. You probably wouldn't have bought the car if the salesman had told you its performance did not extend beyond the lot. 

          A converging idea is to systematically examine which contextual changes might impact which (types of) findings. Here is one example. We always assume that subjects are completely naïve with regard to an experiment, but how can we be sure? On the surface, this is primarily a problem that vexes on-line research using databases such as Mechanical Turk, which has forums on which subjects discuss experiments. But even with the good old lab experiment we cannot always sure that our subjects are naïve to the experiment, especially when we try to replicate a famous experiment. If subjects are not blank slates with regard to an experiment, a variation of population has occurred relative to the original experiment. We've gone from sampling from a population of completely naïve subjects to sampling from one with an unknown percentage of repeat-subjects.

            Jesse Chandler and colleagues recently examined whether prior participation in experiments affect effect sizes. They tested subjects in a number of behavioral economics tasks (such as sunk cost and anchoring and adjustment) and then retested these same individuals a few days later. Chandler et al. found an estimated 25% reduction in effect size, suggesting that the subjects’ prior experience with the experiment did indeed affect their performance in the second wave. A typical characteristic of these experiments is that they require reasoning, which is a controlled process. How about tasks that tap more into automatic processing?

             To examine this question, my colleagues and I examined nine well-known effects in cognitive psychology, three from the domain of perception/action, three from memory, and three from language. We tested our subjects in two waves, the second wave three days later than the first one. In addition, we used either the exact same stimulus set or a different set (with the same characteristics, of course).

            As we expected, all effects replicated easily in an online environment. More importantly, in contrast to Chandler and colleagues' findings, repeated participation did not lead to a reduction in effect size in our experiments. Also, it did not make a difference if the exact same stimuli were used or a different set.

            Maybe you think that this is not a surprising set of findings. All I can say that before running the experiments, our preregistered prediction was that we would obtain a small reduction of effect sizes (smaller than the 25% of Chandler et al.). So we at least were a little surprised to find no reduction.

            A couple of questions are worth considering. First, do the results indicate that the initial participation left no impression whatsoever on the subjects? No, we cannot say this. In some of the response-time experiments, for example, we obtained faster responses in wave 2 than in wave 1. However, because the responses also became less varied in their performance, the effect size did not change appreciably. A simple way to put it would be to say that the subjects became better at performing the task (as they perceived it) but remained equally sensitive to the manipulation. In other cases, such as the simple perception/action tasks, responses did not speed up, presumably because subjects were already performing at asymptote level.

            Second, how non-naïve were our subjects in wave 1? We have no guarantee that the subjects in wave 1 were completely naïve with regard to our experiments. What our data do show, though, is that the 9 effects replicate in an online environment (wave 1) and that repeating the experiment a mere few days later (wave 2) by the same research group does not reduce the effect size.

           So, in this sense, you can step into the same river twice. 

* Automotive metaphors are popular in the replication debate, see also this opinion piece in Collabra: Psychology by Simine Vazire.


Monday, May 8, 2017

Concurrent Replication

I’m working on a paper with Alex Etz, Rich Lucas, and Brent Donnellan. We had to cut 2,000 words and the text below is one of the darlings we killed. I’m reviving it as a blog post here because even though it made sense to cut the segment from the manuscript (I cut it myself, the others didn’t make me), the notion of concurrent replication is an important one.

The current replication debate has, for various reasons, construed replication as a retrospective process. A research group decides to replicate a finding that is already in the published literature. Some of the most high-profile replication studies, for example, have focused on findings published decades earlier, for example the registered replication projects on verbal overshadowing (Alogna et al, 2014) and facial feedback (Wagenmakers et al., in press). This retrospective approach, however timely and important, might be partially responsible for the controversial reputation that replication currently enjoys.
A form of replication that has received not much attention yet is what I will call concurrent replication. The basic idea is this. A research group formulates a hypothesis that they want to test. At the same time, they desire to have some reassurance about the reliability of the finding they expect to obtain. They decide to team up with another research group. They provide this group with a protocol for the experiment, the program and stimuli to run the experiment, and the code for the statistical analysis of the data. The experiment is preregistered. Both groups then each run the experiment and analyze the data independently. The results of both studies are included in the article, along with a meta-analysis of the results. This is the simplest variant. A concurrent replication effort can involve more groups of researchers.
A direct exchange of experiments (a straight “study swap”) is the simplest model of concurrent replication. It is possible to accomplish such study swaps on a larger scale where participating labs offer and request subject hours. This will likely result in a network of labs each potentially simultaneously engaged in forming and testing novel hypotheses as well as concurrently replicating hypotheses formed by other labs. The Open Science Framework features a site that has recently been developed to facilitate concurrent replication, Study Swap, see also this article.  At the time of this writing, there are four projects listed on Study Swap. We hope this number will increase soon.
Aside from this, there already are several large-scale concurrent replication efforts. An example is the Pipeline Project, a systematic effort to conduct prepublication replications, independently performed by separate labs. The first instalment was recently published (Schweisberg et al. 2016) and a second project is underway.
Concurrent replication has several advantages. First, researchers have a better sense of the reliability of their findings prior to publication.  After all, the results have been independently replicated before submission of the article. Likewise, journal editors and reviewers will have more confidence in the findings reported in the manuscript they are asked to evaluate. Journals have the luxury of publishing findings that have already been independently replicated. As a result, the reproducibility of the findings in the literature will start to increase. The Schweisberg et al. (2016) study demonstrates that concurrent replication is not only possible but also useful.
Concurrent replication forces researchers to be explicit about the procedure by which they expect to obtain the effect. If they do indeed obtain the finding both in the original study and in an independent replication, they have what amounts to a scientific finding according to the criteria established by Popper: They can describe a procedure by which the finding can reliably be produced. It will be easy and natural to include the protocol into the method section of the article. A positive side-effect of this will be a marked improvement in the quality of method sections in the literature. As a result, researchers who want to build on these findings have two advantages that researchers currently do not enjoy. First, they can build on a firmer foundation. After all, the reported finding has already been independently replicated. Second, a replication recipe doesn’t have to be laboriously reconstructed. It is readily available in the article.
Of course, concurrent replication is not without challenges. For instance, how should authorship be determined given such an arrangement? A flexible approach is best here. At one extreme the original group’s hypothesis might be very close to the replicating group’s own interest. In this case it would therefore be logical to make members of both groups co-authors; each group may have something to add to the paper both in terms of data and analysis and in terms of theory. At the other extreme, the second group has no direct interest in the hypothesis but may be willing to run a replication, perhaps in exchange for a replication of one of their own experiments. In this case it might be sufficient to acknowledge the other group’s involvement without offering co-authorship.
Thus far, the discussion here has only involved a scenario in which the hypothesis is supported in both the initiating as in the replicating lab. However, other scenarios are also possible. The second scenario is one in which the hypothesis is supported in one of the labs but not in the other. If the meta-analysis shows heterogeneity among the findings, researchers might hypothesize about a potential difference between the experiments, preregister that hypothesis and test it, again with a direct replication. If the meta-analysis does not show heterogeneity, it might be decided that it is sufficient to report the meta-analytic effect. If neither lab shows the effect, the research groups might report the results without engaging in follow-up studies. Alternatively, they might decide the experimental procedure was suboptimal, revise it, preregister the new experiment and run it, along with one or more concurrent replications.
To summarize, concurrent replication forms an underrepresented but potentially extremely valuable form of replication. Several concurrent large-scale replication efforts are currently underway and a platform that also facilitates conducting smaller-scale projects is available for use. The fact that concurrent replications are often viewed positively by the field is further evidence of the importance of replication for scientific endeavors.


Alogna, V. K., Attaya, M. K., Aucoin, P., Bahnik, S., Birch, S., Birt, A. R., ... Zwaan, R. A. (2014). Registered replication report: Schooler & Engstler-Schooler (1990). Perspectives on Psychological Science, 9, 556–578.
Schweinsberg, M. et al. (2016). The pipeline project: pre-publication independent replications of a single laboratory's research pipeline. journal of experimental social psychology, 66, 55–67.
Wagenmakers, E.-J., Beek, T., Dijkhoff, L., Gronau, Q. F., Acosta, A., Adams, R. B., Jr., . . . Zwaan, R. A. (2016). Registered Replication Report: Strack, Martin, & Stepper (1988). Perspectives on Psychological Science, 11, 917–928.