Thursday, November 28, 2013

What Can we Learn from the Many Labs Replication Project?


The first massive replication project in psychology has just reached completion (several others are to follow). A large group of researchers, which I will refer to as ManyLabs, has attempted to replicate 15 findings from the psychological literature in various labs across the world. The paper is posted on the Open Science Framework (along with the data) and Ed Yong has authored a very accessible write-up. [Update May 20, 2014, the article is out now and is open access.]

What can we learn from the ManyLabs project? The results here show the effect sizes for the replication efforts (in green and grey) as well as the original studies (in blue). The 99% confidence intervals are for the meta-analysis of the effect size (the green dots); the studies are ordered by effect size.

Let’s first consider what we canNOT learn from these data. Of the 13 replication attempts (when the first four are taken together), 11 succeeded and 2 did not (in fact, at some point ManyLabs suggests that a third one, Imagined Contact also doesn’t really replicate). We cannot learn from this that the vast majority of psychological findings will replicate, contrary to this Science headline, which states that these findings “offer reassurance” about the reproducibility of psychological findings. As Ed Yong (@edyong209) joked on Twitter, perhaps ManyLabs has stumbled on the only 12 or 13 psychological findings that replicate! Because the 15 experiments were not a random sample of all psychology findings and it’s a small sample anyway, the percentage is not informative, as ManyLabs duly notes.

But even if we had an accurate estimate of the percentage of findings that replicate, how useful would that be? Rather than trying to arrive at a more precise estimate, it might be more informative to follow up the ManyLabs projects with projects that focus on a specific research area or topic, as I proposed in my first-ever post, as this might lead to theory advancement.

So what DO we learn from the ManyLabs project? We learn that for some experiments, the replications actually yield much larger effects that the original studies, a highly intriguing findings that warrants further analysis.

We also learn that the two social priming studies in the sample, dangling at the bottom of the list in the figure, were resoundingly nonreplicated. One study found that exposure to the United States flag increases conservatism among Americans; the other study found that exposure to money increases endorsement of the current social system. The replications show that there essentially is no effect whatsoever for either of these exposures.

It is striking how far the effects sizes of the original studies (indicated by an x) are away from the rest of the experiments. There they are, by their lone selves at the bottom right of the figure. Given that all of the data from the replication studies have been posted online, it would be fascinating to get the data from the original studies. Comparisons of the various data sets might shed light on why these studies are such outliers.

We also learn that the online experiments in the project yielded results that are highly similar to those produced by lab experiments. This does not mean, of course, that any experiment can be transferred to an online environment, but it certainly inspires confidence in the utility of online experiments in replication research.

Most importantly, we learn that several labs working together yield data that have an enormous evidentiary power. At the same time, it is clear that such large-scale replication projects will have diminishing returns (for example, the field cannot afford to devote countless massive replication efforts to not replicating all the social priming experiments that are out there). However, rather than using the ManyLabs approach retrospectively, we can also use it prospectively: to test novel hypotheses.

Here is how this might go.

(1) A group of researchers form a hypothesis (not by pulling it out this air but by deriving it from a theory, obviously).
(2) They design—perhaps via crowd sourcing—the best possible experiment.
(3) They preregister the experiment.
(4) They post the protocol online.
(5) They simultaneously carry out the experiment in multiple labs.
(6) They analyze and meta-analyze the data.
(7) They post the data online.
(8) They write a kick-ass paper.

And so I agree with the ManyLabs authors when they conclude that a consortium of laboratories could provide mutual support for each other by conducting similar large-scale investigations on original research questions, not just replications. Among the many accomplishments of the ManyLabs project, showing us the feasibility of this approach might be its major one.