Wednesday, July 9, 2014

Developing Good Replication Practices

In my last post, I described a (mostly) successful replication by Steegen et al. of the ”crowd-within effect.” The authors of that replication effort felt that it would be nice to mention all the good replication research practices that they had implemented in their replication effort.

And indeed, positive psychologist that I am, I would be remiss if I didn’t extol the virtues of the approach in that exemplary replication paper, so here goes.

Make sure you have sufficient power.
We all know this, right?

Preregister your hypotheses, analyses, and code.
I like how the replication authors went all out in preregistering their study. It is certainly important to have the proposed analyses and code worked out up front.

Make a clear distinction between confirmatory and exploratory analyses.
The authors did here exactly as the doctor, A.D. de Groot in this case, ordered. It is very useful to perform exploratory analyses but they should be separated clearly from the confirmatory ones.

Report effect sizes.

Use both estimation and testing, so your data can be evaluated more broadly, by people from different statistical persuasions.

Use both frequentist and Bayesian analyses.
Yes, why risk being pulled over by a Bayes trooper or having a run-in with the Frequentist militia? Again, using multiple analyses allows your results to be evaluated more broadly.

Adopt a co-pilot multi-software approach.
A mistake in data analysis is easily made and so it makes sense to have two or more researchers analyse the data from scratch. A co-author and I used a co-pilot approach as well in a recent paper (without knowing the cool name for this approach, otherwise we would have bragged about it in the article). We discovered that there were tiny discrepancies between our analyses with each of us making a small error here and there. The discrepancies were easily resolved but the errors probably would have gone undetected had we not used the co-pilot approach. Using a multi-software approach seems a good additional way to minimize the likelihood of errors.

Make the raw and processed data available.
When you ask people to share their data, they typically send you the processed data but the raw data are often more useful. The combination is even more useful as it allows other researchers to retrace the steps from raw to processed data. 

Use multiple ways to assess replication success.
This is a good idea in the current climate where the field has not settled on a single method yet. Again, it allows the results to be evaluated more broadly than with a single-method approach.

Maybe these methodological strengths are worth mentioning too?, the first author of the replication study, Sara Steegen, suggested in an email.


I thank Sara Steegen for feedback on a previous version of this post.


  1. How many of these are not also good practices for non-replication studies?

    Joking aside: The "co-pilot multi-software" approach is a good point that should be made more widely. Interpreting results is increasingly a question of trusting that the authors have operated (and, in the case of R or Mplus, etc, programmed) the software correctly. Presumably with SPSS and SAS we have fewer arithmetic errors than in the days of log tables and slide rules, but I wonder if we don't perhaps have more methodological errors; those software packages will spit out some number, any number, more or less regardless of what data you throw at them, and whether those data are in the right columns.

    In an article which I currently have in press (an unsuccessful reproduction of results from a published dataset; the original results were caused by some spectacular errors in the authors' statistical analyses), I got two independent researchers to reproduce my numbers, starting with the original raw data, the original article, and a brief explanatory e-mail from the original author saying how the method worked. So either we're right, or we're all making the same mistakes. (That last sentence is, of course, a good general description of any snapshot of the state of science!)

    While doing that, however, we discovered some interesting discrepancies between tools. For example, if you have a missing value for one of the IVs in a regression in SPSS or R, the entire record for that subject will be ignored, whereas Matlab will, by default, silently insert a zero.

    1. Yes, leave the jokes to me, will you? ;)

      I'm glad you're endorsing the "co-pilot multi-software" approach. It is also a way to minimize confirmation bias. A mistake that produces a pattern in line with the hypotheses will be less likely to be detected than one that destroys a predicted effect.