Wednesday, July 9, 2014

Developing Good Replication Practices

In my last post, I described a (mostly) successful replication by Steegen et al. of the ”crowd-within effect.” The authors of that replication effort felt that it would be nice to mention all the good replication research practices that they had implemented in their replication effort.

And indeed, positive psychologist that I am, I would be remiss if I didn’t extol the virtues of the approach in that exemplary replication paper, so here goes.

Make sure you have sufficient power.
We all know this, right?

Preregister your hypotheses, analyses, and code.
I like how the replication authors went all out in preregistering their study. It is certainly important to have the proposed analyses and code worked out up front.

Make a clear distinction between confirmatory and exploratory analyses.
The authors did here exactly as the doctor, A.D. de Groot in this case, ordered. It is very useful to perform exploratory analyses but they should be separated clearly from the confirmatory ones.

Report effect sizes.
Yes.

Use both estimation and testing, so your data can be evaluated more broadly, by people from different statistical persuasions.

Use both frequentist and Bayesian analyses.
Yes, why risk being pulled over by a Bayes trooper or having a run-in with the Frequentist militia? Again, using multiple analyses allows your results to be evaluated more broadly.

Adopt a co-pilot multi-software approach.
A mistake in data analysis is easily made and so it makes sense to have two or more researchers analyse the data from scratch. A co-author and I used a co-pilot approach as well in a recent paper (without knowing the cool name for this approach, otherwise we would have bragged about it in the article). We discovered that there were tiny discrepancies between our analyses with each of us making a small error here and there. The discrepancies were easily resolved but the errors probably would have gone undetected had we not used the co-pilot approach. Using a multi-software approach seems a good additional way to minimize the likelihood of errors.

Make the raw and processed data available.
When you ask people to share their data, they typically send you the processed data but the raw data are often more useful. The combination is even more useful as it allows other researchers to retrace the steps from raw to processed data. 

Use multiple ways to assess replication success.
This is a good idea in the current climate where the field has not settled on a single method yet. Again, it allows the results to be evaluated more broadly than with a single-method approach.

Maybe these methodological strengths are worth mentioning too?, the first author of the replication study, Sara Steegen, suggested in an email.

Check.


I thank Sara Steegen for feedback on a previous version of this post.

Thursday, July 3, 2014

Is There Really a Crowd Within?

In 1907 Francis Galton (two years prior to becoming “Sir”) published a paper in Nature titled “Vox populi” (voice of the people). With the rise of democracy in the (Western) world, he wondered how much trust people could put in public judgments. How wise is the crowd, in other words?

As luck would have it, a weight-judging competition was carried on at the annual show of the West of England Fat Stock and Poultry Exhibition (sounds like a great name for a band) in Plymouth. Visitors had to estimate the weight of a prize-winning ox when slaughtered and “dressed” (meaning that its internal organs would be removed).

Galton collected all 800 estimates. He removed thirteen (and nicely explains why) and then analyzed the remaining 787 ones. He computed the median estimate and found that it was less than 1% from the ox’s actual weight. Galton concludes: This result is, I think, more creditable to the trust-worthiness of a democratic judgment than might have been expected. 

This may seem like a small step to Galton and a big step to the rest of us but later research has confirmed that in making estimates the average of a group of people is more accurate than the predictions of most of the individuals. The effect hinges on when some of the errors in the individual estimates are statistically independent from one another.

In 2008 Edward Vul and Hal Pashler gave an interesting twist to the wisdom of the crowd idea. What would happen, they wondered, if you allow the same individual to make two independent estimates? Would the average of these estimates be more accurate than each of the individual estimates?

Vul and Pashler tested this idea by having 428 subjects guess answers to questions such as What percentage of the world’s airports are in the United States? Vul and Pashler further reasoned that the more the estimates differed from each other, the more accurate their average would be. To test this idea, they manipulated the time between the first and second guess. One group second-guessed themselves immediately whereas the other group made the second guess three weeks later.

Here is what Vul and Pashler found.   



They did indeed observe that the the average of the two guesses was more accurate than each of the guesses separately (the green bars representing the mean squared error are lower than the blue and red ones). Furthermore, the effect of averaging was larger in the 3-week delay condition than in the immediate condition.

Vul and Pashler conclude that forcing a second guess leads to higher accuracy than is obtained by a first guess and that this gain is enhanced by temporally separating the two guesses. So "sleeping on it" works.

How reproducible are these findings? That is what Sara Steegen, Laura Dewitte, Francis Tuerlinckx, and Wolf Vanpaemel set out to investigate in a preregistered replication of the Vul and Pashler study in a special issue of Frontiers in Cognition that I’m editing with my colleague René Zeelenberg. 

Steegen and colleagues tested Flemish psychology students rather than a more diverse sample. They obtained the following results.


Like Vul and Pashler, they obtained a crowd-within effect. The average of the two guesses was more accurate than each of the guesses separately both in the immediate and in the delayed condition. Unlike in Vul and Pashler (2008), the accuracy gain of averaging both guesses compared to guess 1 was not significantly larger in the delayed condition (although it was in the same direction). Instead, the accuracy gain of the average was larger in the delayed condition than in the immediate condition when it was compared to the second guess.

So this replication attempt yields two important pieces of information: (1) the crowd-within effect seems robust, (2) the effect of delay on accuracy gain needs to be investigated more closely. It's not clear yet whether or when "sleeping on it" works.

Edward Vul, the first author of the original crowd-within paper was a reviewer of the replication study. I like how he responded to the results in recommending acceptance of the paper:

The authors carried out the replication as they had planned.  I am delighted to see the robustness of the Crowd Within effect verified (a couple of non-preregistered and thus less-definitive replications had also found the effect within the past couple of years).  Of course, I'm a bit disappointed that the results on replicating the contrast between immediate and delayed benefits are mixed, but that's what the data are.  

The authors have my thanks for doing this service to the community" [quoted with permission]

 Duly noted.