In my previous post I
described how direct replications provide insight into the reliability of
findings but not so much their validity. Dan Simons yesterday wrote a insightful
post
in response to this (I love the rapid but thoughtful scientific communication
afforded by blogs). As Dan said, we are basically in agreement: it is very
important to conduct direct replications and it is also important to assess the
validity of our findings.
Dan’s post made me think about this issue a little more and
I think I can articulate it more clearly now (even though I have just been
sitting in the sun drinking a couple of beers). To clarify, let me first
quote Dan when he describes my proposal:
This approach,
allowing each replication to vary provided that they follow a more general
script, might not provide a definitive test of the reliability of a
finding. Depending on how much each study deviated from the other studies, the
studies could veer into the territory of conceptual replication rather than
direct replication. Conceptual replications are great, and they do help to
determine the generality of a finding, but they don't test the reliability of a
finding.
There is no clear dividing line between direct and
conceptual replications but what I am advocating are not conceptual
replications. Here is why.
Most of the studies that have been focused on in discussions
on reproducibility are what you might call one-shot between-subjects studies.
Examples are the typical social priming
studies such as the by now notorius professor
prime and bingo
walking speed experiments. Another example is the free will study by Vohs
and Schooler that I discussed in an earlier
post. The verbal overshadowing experiment that was the topic of my previous post
is yet another example and the one I want to focus on here.
In these one-shot experiments the manipulation is between
subjects and there essentially is only one prime-target pair. People see one video. Then they either describe the bank robber or they name capitals of U.S. states and then both groups perform the same line-up test.
It is instructive
to contrast this type of design with that of a typical cognitive psychology
experiment. Let’s take a Stroop experiment. In a one-shot variant of this type of study one group of subjects sees
the word red in green (experimental condition) and the other group sees the
word red in red (control condition). We compute the mean naming time (across
subjects) for each condition, compare them and voilĂ : Bob's our uncle.
However, people would justifiably complain about this
experiment. Is red really a
representative color word? How about blue, yellow, green, purple, orange,
magenta, turquoise, and so on? And how do we know our one group of subjects is
comparable to the other?
This is why many cognitive experiments have a within-subjects
repeated-measures design. In such a design each subject would see not only
red but
also red, as well as green and green,
and so on, with order of presentation counterbalanced across subjects. This design allows us to assess for each word whether and how much it is read faster in the congruent
than in the incongruent condition.
The benefit of this design is that we will be able to assess whether our finding generalizes across items. In
the typical experiment not all items will show the effect, just like not all
subjects will show the effect. But thanks to early work by Herb
Clark and recent work by others,
we have methods at our disposal to assess the generalizability of our findings
across items.
It is logically impossible to analyze the generalizability
across items for one-shot studies. And I guess that this is what makes me
uncomfortable about them and what has prompted the proposal I made in my previous post. According
to this proposal, in the verbal-overshadowing study a direct replication of the
original study would be one item pair, for example red-red.
Another study would include a video and line-up that meet pre-specified constraints;
this would constitute the second item pair (green-green) and
so on. The next step would involve taking a meta-perspective. A composite effect size of all the experiments (or something like it) would be the decisive test of the effect.
This is the analogy I am thinking of. Dan’s post made me
realize that in a way you could call green-green a conceptual
replication of red-red. However, I prefer to think of it as
assessing the validity of a finding across a set of items that are pulled from
a pool of possible items. Such a concerted replication effort plus meta-analytic approach would have higher validity than a
set of direct replications and the loss in reliability would be relatively small.
To be sure, I am not advocating following this approach
instead of performing direct replications. Rather, I propose that we do this in addition to direct replications,
possibly as a next step for findings that have proven directly replicable. Obviously, this approach is especially relevant with
regard to one-shot studies but it might be applied more broadly.
Reliability is important but so is validity. We ultimately want
to know what the strength of our theories is.
Reacties
Een reactie posten