Thursday, July 3, 2014

Is There Really a Crowd Within?

In 1907 Francis Galton (two years prior to becoming “Sir”) published a paper in Nature titled “Vox populi” (voice of the people). With the rise of democracy in the (Western) world, he wondered how much trust people could put in public judgments. How wise is the crowd, in other words?

As luck would have it, a weight-judging competition was carried on at the annual show of the West of England Fat Stock and Poultry Exhibition (sounds like a great name for a band) in Plymouth. Visitors had to estimate the weight of a prize-winning ox when slaughtered and “dressed” (meaning that its internal organs would be removed).

Galton collected all 800 estimates. He removed thirteen (and nicely explains why) and then analyzed the remaining 787 ones. He computed the median estimate and found that it was less than 1% from the ox’s actual weight. Galton concludes: This result is, I think, more creditable to the trust-worthiness of a democratic judgment than might have been expected. 

This may seem like a small step to Galton and a big step to the rest of us but later research has confirmed that in making estimates the average of a group of people is more accurate than the predictions of most of the individuals. The effect hinges on when some of the errors in the individual estimates are statistically independent from one another.

In 2008 Edward Vul and Hal Pashler gave an interesting twist to the wisdom of the crowd idea. What would happen, they wondered, if you allow the same individual to make two independent estimates? Would the average of these estimates be more accurate than each of the individual estimates?

Vul and Pashler tested this idea by having 428 subjects guess answers to questions such as What percentage of the world’s airports are in the United States? Vul and Pashler further reasoned that the more the estimates differed from each other, the more accurate their average would be. To test this idea, they manipulated the time between the first and second guess. One group second-guessed themselves immediately whereas the other group made the second guess three weeks later.

Here is what Vul and Pashler found.   

They did indeed observe that the the average of the two guesses was more accurate than each of the guesses separately (the green bars representing the mean squared error are lower than the blue and red ones). Furthermore, the effect of averaging was larger in the 3-week delay condition than in the immediate condition.

Vul and Pashler conclude that forcing a second guess leads to higher accuracy than is obtained by a first guess and that this gain is enhanced by temporally separating the two guesses. So "sleeping on it" works.

How reproducible are these findings? That is what Sara Steegen, Laura Dewitte, Francis Tuerlinckx, and Wolf Vanpaemel set out to investigate in a preregistered replication of the Vul and Pashler study in a special issue of Frontiers in Cognition that I’m editing with my colleague René Zeelenberg. 

Steegen and colleagues tested Flemish psychology students rather than a more diverse sample. They obtained the following results.

Like Vul and Pashler, they obtained a crowd-within effect. The average of the two guesses was more accurate than each of the guesses separately both in the immediate and in the delayed condition. Unlike in Vul and Pashler (2008), the accuracy gain of averaging both guesses compared to guess 1 was not significantly larger in the delayed condition (although it was in the same direction). Instead, the accuracy gain of the average was larger in the delayed condition than in the immediate condition when it was compared to the second guess.

So this replication attempt yields two important pieces of information: (1) the crowd-within effect seems robust, (2) the effect of delay on accuracy gain needs to be investigated more closely. It's not clear yet whether or when "sleeping on it" works.

Edward Vul, the first author of the original crowd-within paper was a reviewer of the replication study. I like how he responded to the results in recommending acceptance of the paper:

The authors carried out the replication as they had planned.  I am delighted to see the robustness of the Crowd Within effect verified (a couple of non-preregistered and thus less-definitive replications had also found the effect within the past couple of years).  Of course, I'm a bit disappointed that the results on replicating the contrast between immediate and delayed benefits are mixed, but that's what the data are.  

The authors have my thanks for doing this service to the community" [quoted with permission]

 Duly noted. 


  1. "Unlike in Vul and Pashler (2008), the accuracy gain of averaging both guesses compared to guess 1 was not significantly larger in the delayed condition (although it was in the same direction)."

    The fact that the effect was significant in one study but not the other is not very informative. As the saying goes, the difference between significant and nonsignificant is not, itself, necessarily significant.

    The more interesting question is, was the effect observed in the replication significantly different from the effect observed in the original? I just blogged about this the other day:

    Uri Simonsohn has proposed a different criterion for evaluating replications: a test of whether the replication rules out an effect big enough for the original study to have detected.

  2. Just a note about how figures are drawn. If you put the floor of the second figure at 400 (like the first figure), suddenly the two sets of results look a lot more similar.
    --David Funder (not anonymous)

    1. Or more appropriately (in my view) put the floor of the first figure to 0 since it meaningful and the bottom of the scale.

  3. With a d=.02 for the effect that failed to replicate, the solution is simple: run it on Facebook.

  4. ~ Nice summary and thanks for posting. One statement stands out that you did not address. Granted, it might not be relevant, but you note that "Steegen and colleagues tested Flemish psychology students rather than a more diverse sample." Given that variability of "the crowd" might contribute to the traditional "wisdom of the crowds" benefit of averaging (as obtained in the traditional studies on this effect), might greater variability in the Vul and Pashler study, as compared to Steegan et al., warrant further investigation?

    1. Thanks, Steve. The question of interest, multiple estimates of the same subject, it is a within-subjects measure.