In my previous post I described how direct replications provide insight into the reliability of findings but not so much their validity. Dan Simons yesterday wrote a insightful post in response to this (I love the rapid but thoughtful scientific communication afforded by blogs). As Dan said, we are basically in agreement: it is very important to conduct direct replications and it is also important to assess the validity of our findings.
Dan’s post made me think about this issue a little more and I think I can articulate it more clearly now (even though I have just been sitting in the sun drinking a couple of beers). To clarify, let me first quote Dan when he describes my proposal:
This approach, allowing each replication to vary provided that they follow a more general script, might not provide a definitive test of the reliability of a finding. Depending on how much each study deviated from the other studies, the studies could veer into the territory of conceptual replication rather than direct replication. Conceptual replications are great, and they do help to determine the generality of a finding, but they don't test the reliability of a finding.
There is no clear dividing line between direct and conceptual replications but what I am advocating are not conceptual replications. Here is why.
Most of the studies that have been focused on in discussions on reproducibility are what you might call one-shot between-subjects studies. Examples are the typical social priming studies such as the by now notorius professor prime and bingo walking speed experiments. Another example is the free will study by Vohs and Schooler that I discussed in an earlier post. The verbal overshadowing experiment that was the topic of my previous post is yet another example and the one I want to focus on here.
In these one-shot experiments the manipulation is between subjects and there essentially is only one prime-target pair. People see one video. Then they either describe the bank robber or they name capitals of U.S. states and then both groups perform the same line-up test.
It is instructive to contrast this type of design with that of a typical cognitive psychology experiment. Let’s take a Stroop experiment. In a one-shot variant of this type of study one group of subjects sees the word red in green (experimental condition) and the other group sees the word red in red (control condition). We compute the mean naming time (across subjects) for each condition, compare them and voilà: Bob's our uncle.
However, people would justifiably complain about this experiment. Is red really a representative color word? How about blue, yellow, green, purple, orange, magenta, turquoise, and so on? And how do we know our one group of subjects is comparable to the other?
This is why many cognitive experiments have a within-subjects repeated-measures design. In such a design each subject would see not only red but also red, as well as green and green, and so on, with order of presentation counterbalanced across subjects. This design allows us to assess for each word whether and how much it is read faster in the congruent than in the incongruent condition.
The benefit of this design is that we will be able to assess whether our finding generalizes across items. In the typical experiment not all items will show the effect, just like not all subjects will show the effect. But thanks to early work by Herb Clark and recent work by others, we have methods at our disposal to assess the generalizability of our findings across items.
It is logically impossible to analyze the generalizability across items for one-shot studies. And I guess that this is what makes me uncomfortable about them and what has prompted the proposal I made in my previous post. According to this proposal, in the verbal-overshadowing study a direct replication of the original study would be one item pair, for example red-red. Another study would include a video and line-up that meet pre-specified constraints; this would constitute the second item pair (green-green) and so on. The next step would involve taking a meta-perspective. A composite effect size of all the experiments (or something like it) would be the decisive test of the effect.
This is the analogy I am thinking of. Dan’s post made me realize that in a way you could call green-green a conceptual replication of red-red. However, I prefer to think of it as assessing the validity of a finding across a set of items that are pulled from a pool of possible items. Such a concerted replication effort plus meta-analytic approach would have higher validity than a set of direct replications and the loss in reliability would be relatively small.
To be sure, I am not advocating following this approach instead of performing direct replications. Rather, I propose that we do this in addition to direct replications, possibly as a next step for findings that have proven directly replicable. Obviously, this approach is especially relevant with regard to one-shot studies but it might be applied more broadly.
Reliability is important but so is validity. We ultimately want to know what the strength of our theories is.