In 1907 Francis Galton (two years prior to becoming “Sir”) published
a paper
in Nature titled “Vox populi” (voice
of the people). With the rise of democracy in the (Western) world, he wondered
how much trust people could put in public judgments. How wise is the crowd, in
other words?
As luck would have it, a weight-judging competition was
carried on at the annual show of the West of England Fat Stock and Poultry Exhibition (sounds
like a great name for a band) in Plymouth. Visitors had to estimate the weight
of a prize-winning ox when slaughtered and “dressed” (meaning that its internal
organs would be removed).
Galton collected all 800 estimates. He removed thirteen (and
nicely explains why) and then analyzed the remaining 787 ones. He computed the
median estimate and found that it was less than 1% from the ox’s actual weight.
Galton concludes: This result is, I
think, more creditable to the trust-worthiness of a democratic judgment than
might have been expected.
This may seem like a small step to Galton and a big step to the rest of us but later research has confirmed that in making estimates the
average of a group of people is more accurate than the predictions of most of
the individuals. The effect hinges on when some of the errors in the individual
estimates are statistically independent from one another.
In 2008
Edward Vul and Hal Pashler gave an interesting twist to the wisdom of the
crowd idea. What would happen, they wondered, if you allow the same individual
to make two independent estimates? Would the average of these estimates be more
accurate than each of the individual estimates?
Vul and Pashler tested this idea by having 428 subjects guess
answers to questions such as What
percentage of the world’s airports are in the United States? Vul and Pashler further reasoned that the more
the estimates differed from each other, the more accurate their average would
be. To test this idea, they manipulated the time between the first and second
guess. One group second-guessed themselves immediately whereas the other group
made the second guess three weeks later.
Here is what Vul and Pashler found.
They did indeed observe that the the average of the two guesses
was more accurate than each of the guesses separately (the green bars
representing the mean squared error are lower than the blue and red ones).
Furthermore, the effect of averaging was larger in the 3-week delay condition
than in the immediate condition.
Vul and Pashler conclude that forcing a second guess leads
to higher accuracy than is obtained by a first guess and that this gain is
enhanced by temporally separating the two guesses. So "sleeping on it" works.
How reproducible are these findings? That is what Sara
Steegen, Laura Dewitte, Francis Tuerlinckx, and Wolf
Vanpaemel set out to investigate in a preregistered replication of the Vul and
Pashler study in a special issue of Frontiers
in Cognition that I’m editing with my colleague René Zeelenberg.
Steegen and colleagues tested Flemish psychology students
rather than a more diverse sample. They obtained the following results.
Like Vul and Pashler, they obtained a crowd-within effect.
The average of the two guesses was more accurate than each of the guesses
separately both in the immediate and in the delayed condition. Unlike in Vul
and Pashler (2008), the accuracy gain of averaging both guesses compared to
guess 1 was not significantly larger in the delayed condition (although it was
in the same direction). Instead, the accuracy gain of the average was larger in
the delayed condition than in the immediate condition when it was compared to
the second guess.
So this replication attempt yields two important pieces of
information: (1) the crowd-within effect seems robust, (2) the effect of delay
on accuracy gain needs to be investigated more closely. It's not clear yet whether or when "sleeping on it" works.
Edward Vul, the first author of the original crowd-within
paper was a reviewer of the replication study. I like how he responded to the
results in recommending acceptance of the paper:
The authors carried out the replication as they had planned. I am delighted to see the robustness of the Crowd Within effect verified (a couple of non-preregistered and thus less-definitive replications had also found the effect within the past couple of years). Of course, I'm a bit disappointed that the results on replicating the contrast between immediate and delayed benefits are mixed, but that's what the data are.
The authors have my thanks for doing this service to the community" [quoted with permission]
The authors have my thanks for doing this service to the community" [quoted with permission]
"Unlike in Vul and Pashler (2008), the accuracy gain of averaging both guesses compared to guess 1 was not significantly larger in the delayed condition (although it was in the same direction)."
BeantwoordenVerwijderenThe fact that the effect was significant in one study but not the other is not very informative. As the saying goes, the difference between significant and nonsignificant is not, itself, necessarily significant.
The more interesting question is, was the effect observed in the replication significantly different from the effect observed in the original? I just blogged about this the other day: http://hardsci.wordpress.com/2014/07/01/some-thoughts-on-replication-and-falsifiability-is-this-a-chance-to-do-better/
Uri Simonsohn has proposed a different criterion for evaluating replications: a test of whether the replication rules out an effect big enough for the original study to have detected. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2259879
Just a note about how figures are drawn. If you put the floor of the second figure at 400 (like the first figure), suddenly the two sets of results look a lot more similar.
BeantwoordenVerwijderen--David Funder (not anonymous)
Or more appropriately (in my view) put the floor of the first figure to 0 since it meaningful and the bottom of the scale.
VerwijderenWith a d=.02 for the effect that failed to replicate, the solution is simple: run it on Facebook.
BeantwoordenVerwijderen~ Nice summary and thanks for posting. One statement stands out that you did not address. Granted, it might not be relevant, but you note that "Steegen and colleagues tested Flemish psychology students rather than a more diverse sample." Given that variability of "the crowd" might contribute to the traditional "wisdom of the crowds" benefit of averaging (as obtained in the traditional studies on this effect), might greater variability in the Vul and Pashler study, as compared to Steegan et al., warrant further investigation?
BeantwoordenVerwijderenThanks, Steve. The question of interest, multiple estimates of the same subject, it is a within-subjects measure.
Verwijderen