Doorgaan naar hoofdcontent

What Can we Learn from the Many Labs Replication Project?

The first massive replication project in psychology has just reached completion (several others are to follow). A large group of researchers, which I will refer to as ManyLabs, has attempted to replicate 15 findings from the psychological literature in various labs across the world. The paper is posted on the Open Science Framework (along with the data) and Ed Yong has authored a very accessible write-up. [Update May 20, 2014, the article is out now and is open access.]

What can we learn from the ManyLabs project? The results here show the effect sizes for the replication efforts (in green and grey) as well as the original studies (in blue). The 99% confidence intervals are for the meta-analysis of the effect size (the green dots); the studies are ordered by effect size.

Let’s first consider what we canNOT learn from these data. Of the 13 replication attempts (when the first four are taken together), 11 succeeded and 2 did not (in fact, at some point ManyLabs suggests that a third one, Imagined Contact also doesn’t really replicate). We cannot learn from this that the vast majority of psychological findings will replicate, contrary to this Science headline, which states that these findings “offer reassurance” about the reproducibility of psychological findings. As Ed Yong (@edyong209) joked on Twitter, perhaps ManyLabs has stumbled on the only 12 or 13 psychological findings that replicate! Because the 15 experiments were not a random sample of all psychology findings and it’s a small sample anyway, the percentage is not informative, as ManyLabs duly notes.

But even if we had an accurate estimate of the percentage of findings that replicate, how useful would that be? Rather than trying to arrive at a more precise estimate, it might be more informative to follow up the ManyLabs projects with projects that focus on a specific research area or topic, as I proposed in my first-ever post, as this might lead to theory advancement.

So what DO we learn from the ManyLabs project? We learn that for some experiments, the replications actually yield much larger effects that the original studies, a highly intriguing findings that warrants further analysis.

We also learn that the two social priming studies in the sample, dangling at the bottom of the list in the figure, were resoundingly nonreplicated. One study found that exposure to the United States flag increases conservatism among Americans; the other study found that exposure to money increases endorsement of the current social system. The replications show that there essentially is no effect whatsoever for either of these exposures.

It is striking how far the effects sizes of the original studies (indicated by an x) are away from the rest of the experiments. There they are, by their lone selves at the bottom right of the figure. Given that all of the data from the replication studies have been posted online, it would be fascinating to get the data from the original studies. Comparisons of the various data sets might shed light on why these studies are such outliers.

We also learn that the online experiments in the project yielded results that are highly similar to those produced by lab experiments. This does not mean, of course, that any experiment can be transferred to an online environment, but it certainly inspires confidence in the utility of online experiments in replication research.

Most importantly, we learn that several labs working together yield data that have an enormous evidentiary power. At the same time, it is clear that such large-scale replication projects will have diminishing returns (for example, the field cannot afford to devote countless massive replication efforts to not replicating all the social priming experiments that are out there). However, rather than using the ManyLabs approach retrospectively, we can also use it prospectively: to test novel hypotheses.

Here is how this might go.

(1) A group of researchers form a hypothesis (not by pulling it out this air but by deriving it from a theory, obviously).
(2) They design—perhaps via crowd sourcing—the best possible experiment.
(3) They preregister the experiment.
(4) They post the protocol online.
(5) They simultaneously carry out the experiment in multiple labs.
(6) They analyze and meta-analyze the data.
(7) They post the data online.
(8) They write a kick-ass paper.

And so I agree with the ManyLabs authors when they conclude that a consortium of laboratories could provide mutual support for each other by conducting similar large-scale investigations on original research questions, not just replications. Among the many accomplishments of the ManyLabs project, showing us the feasibility of this approach might be its major one.


  1. Thanks for the post on this Rolf. What I was wondering when I read about the replications in Nature ( was whether or not these were really replications. Doesn't the very fact that they "combined tests from earlier experiments into a single questionnaire — meant to take 15 minutes to complete" mean that they did not, technically, "replicate" the original studies? They essentially created a new study (survey instrument), that contained items from prior studies. That, then, created a set of new contextual factors surrounding these questionnaire items.

    Anyway, since you've thought a lot more about this issue, I'd be interested in your interpretation.

  2. Lykken (1968) distinguished 3 types of replication:

    1. LITERAL. Exact, only the subjects and time changes (e.g., in-lab replication)
    2. OPERATIONAL. Reproduce the methods as best as possible.
    3. CONSTRUCTIVE. Replicate the theoretical construct.

    The scientific credibility awarded to a succesful constructive replication is largest of all, after that operational and least impressive in terms of credibility awarded to a theory, is a literal replication.

    I think ManyLabs shows it is possible to conduct constructive type replications and therefore am not surprised to see variation.

    However, I do wonder about the following: There were original studies that had a power of ~99% to detect the original effect, as well as the replicated effect... there's more to power than sample size!

    By the way... The idea to use this for novel predictions... Where do I sign up? :)

  3. At the risk of sounding like a broken record, why are the two failed priming studies described as "social priming" studies? What is social about priming money or a flag?


    1. I am not an expert, but

      If you subliminally prime a semantic category in order to examine its effect on the response latency in a lexical decision task so you can build a better computational model of reading in adults, I don't think anyone would call it social priming, but semantic priming or something similar.

      If you study the effect of a prime that has acquired meaning as a symbol at the level of a nation, society or culture on the behaviour or attitudes of individuals that concern concepts that are meaningful with respect to a similar aggregate level of nation or society (voting behaviour, position in a political debate, attitude towards conservative or liberal), I think a lot of people would call that social priming.

      The effect, by any other name, would still be 0 on average in this sample.

    2. "The effect, by any other name, would still be 0 on average in this sample." :) Yes indeed! I was not arguing that point at all.

      To take your reply and push the point further, if you allow that any concept that has "acquired meaning as a symbol" to be called "social priming," then actually, your first example is social priming. In fact, all priming is social priming to the extent that language is a social activity wherein symbols come to acquire meaning within a "nation, society, or culture." That describes language perfectly and therefore any priming involving language is social priming.


  4. I guess that's what they are referred to (by others). It's true that it's not easy to label these kids of studies:

    1. Seems fairly easy to label them, as far as I can tell. They are, respectively, "flag priming" and "currency priming," as labeled in the original paper.

    2. Still a bunch of misnomers I guess. Neither flag nor currency was found to prime anything.

  5. One point I haven’t seen discussed but I’m wondering about: how come the effect sizes for the original studies mostly fall within a relatively narrow range? Much narrower than the range of effect sizes from the ManyLabs replication, but centered on about the same grand mean. Is that just happenstance? Is there some obvious explanation I’m missing?


Een reactie posten