Doorgaan naar hoofdcontent

p=.20, what now? Adventures of the Good Ship DataPoint

You’ve dutifully conducted a power analysis, defined your sample size, and conducted your experiment. Alas, p=.20. What now? Let’s find out.

The Good Ship DataPoint*
Perspectives on Psychological Science’s first registered replication project, RRR1, was targeted at verbal overshadowing, the phenomenon that describing a visual stimulus, in this case a human face, is detrimental to later recognition of this face compared to not describing the stimulus. A meta-analysis of  31 direct replications of the original finding provided evidence of verbal overshadowing. Subjects who described the suspect were 16% less likely to make a correct identification than subjects who performed a filler task.

One of my students wanted to extend (or conceptually replicate) the verbal overshadowing effect for her master’s thesis by using different stimuli and a different distractor task. I’m not going to talk about the contents of the research here. I simply want to address the question that’s posed in the title of this post: p=.20, what now? Because p=.20 is what we found after having run 148 subjects, obtaining a verbal overshadowing effect of 9% rather than RRR1's 16%.** 

Option 1. The effect is not significant, so this conceptual replication “did not work,” let’s file drawer the sucker. This response is probably still very common but it contributes to publication bias.

Option 2. We consider this a pilot study and now perform a power analysis based on it and run a new (and much larger) batch of subjects. The old data are now meaningless for hypothesis testing. This is better than option 1 but is rather wasteful. Why throw away a perfectly good data set?

Option 3. Our method wasn’t sensitive enough. Let’s improve it and then run a new study. Probably a very common response. But it may be premature and is not guaranteed to lead to a more decisive result. And you’re still throwing away the old data (see option 1).

Liverpool FC, victorious in the 2005 Champions League final 
in Istanbul after overcoming a 3-0 deficit against AC Milan
Option 4. The effect is not significant, but if we also report the Bayes factor, we can at least say something meaningful about the Null hypothesis and maybe get it published. This seems to become more common nowadays. It is not a bad idea as such, but it is likely to get misinterpreted as: H0 is true (even by the researchers themselves). The Bayes factor tells us something about the support for a hypothesis relative to some other hypothesis given the data such as they are. And what the data are here is: too few. We found BF10= .21, which translates to about 5 times more evidence for H0 than for H1, but this is about as meaningful as the score in a soccer match after 30 minutes of play. Sure, H0 is ahead but H1 might well score a come-from-behind victory. There are after all 60 more minutes to play! 

Option 5.  The effect is not significant but we’ll keep on testing until it is. Simmons et al. have provided a memorable illustration of how problematic optional stopping is. In his blog, Ryne Sherman describes a Monte Carlo simulation of p-hacking, showing that it can inflate the false positive rate from 5% to 20%. Still, the intuition that it would be useful to test more subjects is a good one. And that leads us to…

Option 6. The result is ambiguous, so let’s continue testing—in a way that does not inflate the Type I error rate—until we have decisive information or we've run out of resources. Researchers have proposed several ways of sequential testing that does preserve the normal error rate. Eric-Jan Wagenmakers and colleagues show how repeated testing can be performed in a Bayesian framework and Daniël Lakens has described sequential testing as it is performed in the medical sciences. My main focus will be on a little-known method proposed in psychology by Frick (1998), which to date has been cited only 17 times in Google Scholar. I will report Bayes factors as well. The method described by Lakens could not be used in this case because it requires one to specify the number of looks a priori. 

Frick’s method is called COAST (composite open adaptive sequential test). The idea is appealingly simple: if your effect is >.01 and <.36, keep on testing until the p-value crosses one of these limits.*** Frick’s simulations show that this procedure keeps the overall alpha level under .05. Given that after the first test our p was between the lower and upper limits, our Good Ship DataPoint was in deep waters. Therefore, we continued testing. We decided to add subjects in batches of 60 (barring exclusions) so as to not overshoot and yet make our additions substantive. If DataPoint failed to reach shore before we'd reached 500 subjects, we would abandon ship. 

Voyage of the Good Ship Data Point on the Rectangular Sea of Probability

Batch 2: Ntotal=202, p=.047. People who use optional testing would stop here and declare victory: p<.05! (Of course, they wouldn’t mention that they’d already peeked.) We’re using COAST, however, and although the good ship DataPoint is in the shallows of the Rectangular Sea of Probability, it has not reached the coast. And BF10=0.6, still leaning toward H0.

Batch 3: Ntotal = 258, p=.013, BF10=1.95. We’re getting encouraging reports from the crow’s nest. The DataPoint crew will likely not succumb to scurvy after all! And the BF10 now favors H1.

Batch 4: Ntotal =306, p=.058, BF10=.40. What’s this??? The wind has taken a treacherous turn and we’ve drifted  away from shore. Rations are getting low--mutiny looms. And if that wasn’t bad enough, BF is  <1 again. Discouraged but not defeated, DataPoint sails on.

Batch 5: Ntotal =359, p=.016, BF10=1.10. Heading back in the right direction again.

Batch 6: Ntotal =421, p=.015, BF=1.17. Barely closer. Will we reach shore before we all die? We have to ration the food.

Batch 7: Ntotal =479, p=.003, BF10=4.11. Made it! Just before supplies ran out and the captain would have been keelhauled. The taverns will be busy tonight.

Some lessons from this nautical exercise:

(1) More data=better.

(2) We now have successfully extended the verbal overshadowing effect, although we found a smaller effect, 9% after 148 subjects and 10% at the end of the experiment.

(3) Although COAST gave us an exit strategy, BF10=4.11 is encouraging but not very strong. And who knows if it will hold up? Up to this point it has been quite volatile.

(4) Our use of COAST worked because we were using Mechanical Turk. Adding batches of 60 subjects would be impractical in the lab.

(5) Using COAST is simple and straightforward. It preserves an overall alpha level of .05. I prefer to use it in conjunction with Bayes factors.

(6) It is puzzling that methodological solutions to a lot of our problems are right there in the psychological literature but that so few people are aware of them.


In this post, I have focused on the application of COAST and largely ignored, for didactical purposes, that this study was a conceptual replication. More about this in the next post.


Acknowledgements: I thank Samantha Bouwmeester, Peter Verkoeijen, and Anita Eerland for helpful comments on an earlier version of this post. They don't necessarily agree with me on all of the points raised in the post.
*Starring in the role of DataPoint is the Batavia, a replica of a 17th century Dutch East Indies ship, well worth a visit.
** The original study, Schooler and Engstler-Schooler (1990), has a sample of 37 subjects and the RRR1 studies typically had 50-80 subjects. We used chi-square tests to compute p-values. Unlike the replication studies, we did not collapse the conditions in which subjects made a false identification and in which they claimed the suspect was not in the lineup because we thought these were two different kinds of responses. I computed Bayes factors using the BayesFactor package in R. I used the contingencyTableBF function with sampleType = "indepMulti", fixedMargin = "rows", priorConcentration= 1. In this analysis, we separated false alarms from misses, unlike in the replication experiments. This precluded us, however, from using one-sided tests.
*** For this to work, you need to decide a priori to use COAST. This means, for example, that when your p-value is >.01 and <.05 after the first batch, you need to continue testing rather than conclude that you've obtained a significant effect.


  1. I'm curious whether the COAST approach would be invalidated by stopping with N=500 if you hadn't surpassed the thresholds. That is, do the thresholds depend on continuing to test until you pass either the upper or lower one. If so, then this wouldn't be a particularly useful approach -- you'd have no way to know how many participants you might need to test, and you would be obligated to continue testing indefinitely. Presumably simulations could show whether or not stopping with a fixed N changes the overall false positive rate. For the sake of argument, let's assume that stopping at some fixed N doesn't invalidate the approach. Had you reached N=500 without surpassing one of the p cutoffs, what could you conclude?

    (Note: I tried posting a variant of this comment a few minutes ago, but the site seemed to lose it. Sorry if it shows up twice.)

    1. I agree that this is a potential problem. I believe the same is true about the Bayesian sequential analysis, though. The Bayes factor could in principle remain between 1/3 and 3 indefinitely (I think). So you need to also have a practical exit strategy (a maximum N). So if we'd reached N=500, we'd be forced to conclude that we failed to reject H0 and provide an effect size estimate, stating that this was accomplished with a sample size about 7 times those in RRR1. We'd also report the Bayes factor, which would have probably suggested that there was more support for H1 than for H0 but not convincingly so.

    2. "The Bayes factor could in principle remain between 1/3 and 3 indefinitely (I think)"
      Nope: The BF converges to 0 or infinity with increasing n. So it is guaranteed to leave the inconclusive area. It can take long, but not indefinitely. In a forthcoming paper, we simulate such sequential BFs. The longest journeys happen at d=0, and there 95% of all studies reach a boundary of 3 (resp. 1/3) with n<120 (in a two-group t-test setting).
      But: We do not recommend a BF boundary of 3; rather at least 5.

      COAST sounds interesting, but users should be aware of two properties:

      - At the end you report p<.01; but you should emphasize very clearly that the actual level is p<.05. Furthermore, it is hard to compute the actual (exact) p-value, but it is definitely *not* .003. So the sequential p can be seductive for the casual reader.

      - Effect size estimates can be very biased in sequential designs, conditional on whether you have an early stop or a late stop. If you stop early, ES is overestimated, if you stop late, it can be underestimated. How much bias happens depends on the properties of the COAST procedure.
      Frick does not discuss this issue in his paper, but in the literature of clinical sequential designs this conditional bias is well known. These biased effect sizes do not invalidate the hypothesis test; but the empirical (sample) ES estimates should be treated with caution.

    3. (1) You're right. Let's correct this to "the Bayes factor could in principle remain between 1/5 and 5 until the researcher runs out of resources and/or patience." It is also true of course that p will eventually leave the inconclusive area. (2) You're also right about the actual level of p. This is not reported in the post because I'm just concerned with where we stop. (3) Yes, that's why it is recommended that for meta-analyses, the effect size from the initial sample is used:

    4. BTW: I directly dived into technical issues, but most important: Great post!

      ad 3) Thanks for the link. It seems to me that there's no real consensus yet about how to handle sequential designs in meta-analyses. These authors ( conclude that: `early stopping of clinical trials for apparent benefit is not a substantive source of bias in meta-analyses whereas exclusion of truncated studies from meta-analyses would introduce bias. Evidence synthesis should be based on results from all studies, both truncated and non-truncated' (p. 4873).

    5. Thanks! And interesting reference. This is clearly an issue that is going to deserve greater attention if sequential analyses become more common.

    6. P-value corrections and adjustments for effect sizes have been developed (see Lakens, 2014, or Proschan et al., 2006). Alternatively, meta-analytic adjustments for publication bias might work here (PET-PEESE). So sure, something to keep in mind, but not too problematic.

  2. "in a way that does not inflate the Type I error rate"

    It has to inflate the Type I error rate. The Type I error rate is 0.05 once you finished your first batch. If you got significant results after the first batch you would not use COAST. Above the 0.05, you have to add the Type I error rate added by using COAST. It may be small, but it is necessarily bigger than 0. You cannot hold Type I at 0.05 using sequential testing unless you decide to use it in advance and adjust alpha for the first batch accordingly.

    1. "You cannot hold Type I at 0.05 using sequential testing unless you decide to use it in advance and adjust alpha for the first batch accordingly." This is true.

    2. Yes, so Frick adjusts the criterion value so that p < .05 remains valid.

    3. Sure, but you have to adjust the criterion value even for the first batch. That is, if you decide to use COAST only after getting nonsignificant results, the error rate will be inflated.

    4. Yes, you cannot decide beforehand you're going to use COAST and then drop that plan when your initial sample has <.01 p <.05 and call it a day.

    5. Well, I am interested in the possibility that you seem to recommend in the post – if you don't get significant results, use COAST. That is what increases the Type I error rate. I mean, it may be better than the other options, but it should not be described as holding the Type I error rate at 0.05.

      The way to imagine this: Assume that the null is true. You run the first batch. You find a significant result with probability 0.05 and if you do, you publish the result as significant. If you do not find a significant result (with probability of 0.95), you use COAST. COAST can than give you with certain probability (say 0.04 for the sake of the example) a signficant result. If it does, you describe it as a significant result. The total Type I error rate is than ~0.09 (i.e. 0.05 + 0.95*0.04) and not 0.05.

    6. I see what you're saying. But it is not what I recommend. Rather, it is my attempt to engage the reader by appealing to a situation that is probably familiar to them. I agree that it would have been better to say under option 6: "You could have chosen to use COAST" or something like that.

    7. I have added footnote *** to address your point. Thanks.

  3. Great post. And Frick deserves more attention and citations.

    It is important to stress that type-I error probabilities (p-values) increase with the number of tests that are being conducted. Moreover, biases in effect sizes will be greater if testing stops after each new data point is collected and p reaches criterion value. To avoid these problems it makes sense to increase in meaningful batches. A fixed N = 60 may seem reasonable, but sampling error decreases in a non-linear fashion. So it would be more efficient to increase sample sizes in an exponential function (50, 100, 200, 400). The limit of data collection can be determined with a priori power analysis. Would you give up when the effect size is small (d = .2) and needs N = 800 participants (between-subject design).

    1. Interesting, intuitively I'd think that exponential increases would be less efficient. The idea to use the sample size based on a power analysis for a small effect as the final exit strategy makes a lot of sense.

  4. Hi Rolf,

    I liked this post. I'm curious after having thought about it a little bit.

    You say that the Bayes factors aren't really interesting to you or meaningful after batch1 because surely things might change with the next additions of new data. 60 more minutes of play and all that. And on twitter you said the bayes factor would be meaningful once it stabilizes. But had the results come in more conveniently for you (read: ev+ for the alternative in batch1), wouldn't you have stopped the experiment and simply taken the bayes factor as is? There would not have been 60 more minutes of play in that case, and you would not have thought your data were too few. So are they only uninteresting or not meaningful when they are not convenient for you?

    Perhaps I missed something subtle but that is my perception of your comment in the post and on twitter. Of course interest level is wholly subjective and personal, so I'm interested to hear your thoughts.

    -- Clearly this assumes you would have stopped the experiment after batch1 if the results were different. But as you say, you did plan to initially stop after batch1 so I don't think it is too strong of an assumption on my part.

    1. Glad you liked the post. I haven't seen many sequential Bayesian analyses yet but the ones I've seen, mostly from E.J., show a lot of fluctuation at the beginning (as you might expect) and then some form of stabilization. So this is what I was also expecting for our data. One of E.J.'s stopping criteria is when BF <.10 or >10. This makes a lot of sense to me. So if I'd had a BF of 13 after 148 subjects, yes I probably would have stopped. I get the sense that people are viewing my post as critical of BF but that is not at all how it was intended. I'm merely being critical of people who run an underpowered experiment (e.g., in a replication), fail to reject H0 and then simply tack on a BF claiming to have found evidence for H0 and leave it at that. E.J.'s analyses clearly show that this is a mistake and our data do as well.

    2. I didn't think it was overly critical of BFs, just an interesting thought process. I see what you mean now, and it makes sense.

      I agree with you about tacking on BFs to null results thoughtlessly. Model comparison only makes sense if the models are interesting!

      I think it is hard for many who were trained in traditional stats to get the idea of absolute evidence out of their head. There is no such thing as "evidence against H0, full stop". It's always relative!

    3. This is exactly what I wanted to convey.

  5. I'm glad Frick's work is getting the attention it deserves. I covered it in my book and have mentioned it reviews of a couple papers that cover similar ground but didn't seem aware of COAST and CLAST.

    Frick presents several simulations of the COAST strategy and it is surprisingly hard to break - so it is easy to get Type I error rates a little above .05 using COAST, but under reasonable conditions it seems hard to get COAST to misbehave badly. It will also tend to be more efficient than using a fixed stopping rule.

    1. Thanks, Thom. I tweaked Ryne Sherman's phack program to simulate COAST. It kept alpha <.05 (<.04, actually), so I also find it quite robust.

  6. It seems to me that your question remains unanswered, it is just rephrased as: p=.003, what now? You could also rephrase it as: BF10=4.11, what now?

    If there is a consequence to your investigation, then it should be clear what to do next. (e.g., you run the follow-up study, you decide the effect is too weak to be the basis of a master's thesis, you change your theory, or you advise the justice system to discount eye-witness testimony.) You could have answered any of those questions after batch 1 with the data that was available then. If there was no answer after batch 1, then I cannot see how there can be an answer after batch 7.

    If there is no consequence to your investigation, then you simply report what you found and move on; perhaps with the hope that the data will help someone else in the future. Those future researchers might thank you for running the additional subjects, or they might feel it was overkill.

    1. These are fair points. I'm using the data to illustrate some points. As I say in the post, I haven't talked (yet) about the contents of the study. This is the level at which consequences come in. These consequences are mostly theoretical in nature. There are aspects to the conceptual replication that have likely led to a reduction of the effect. As I said above, this is the topic of a different post.

    2. I think these are really the same issues. If the experiment provides data to help you understand a theoretical issue, then I think you have to accept the answer that is given by the data you have. If the answer is insufficiently definitive, you should feel free to gather more data until the answer becomes sufficiently definitive (whatever your criterion). So, the answer after batch 1 might tentatively be that a model assuming no effect better accounts for the data than a model assuming an effect. It's a tentative conclusion because you realize that more data might change it and you might develop a superior model (e.g., some interaction effect) in the future.

      More or less the same attitude applies after batch 7. You should tentatively conclude that a model assuming an effect better accounts for the data than a model assuming no effect. It's a tentative conclusion because it is (still) the case that future data might change it and you might come up with a better model.

      Various criteria such as p<.05 or p<.01--for COAST, or BF10<1/3 or BF10>3, really do not enter into these conclusions at all. They are just rules of thumb that seem to be useful for some common situations. At the end of each day, you are still just drawing tentative conclusions with an appreciation that they might (hopefully will, as new models are developed) change in the future.

      When does it end? When practical constraints get in the way. If you can easily get large data sets (e.g., using MTurk or lots of funding), then maybe Ntotal=500 (or 5000) is doable. If you are constrained to Ntotal=50, then you give up on this line of research because you do not have the resources to investigate it (or you do a small study and convince others to do the same and then pool the data together).

      I think it is great that you discuss these issues on your blog.

    3. I think we are in agreement. The assessment after batch 1 was that we had inconclusive evidence. Our tentative conclusion after batch 7 is that we have evidence for VO. In a future post I want to address the conceptual replication aspect of this study. Bayesian statistics will be helpful here.

      I'm glad that you like the discussion. I'm learning a lot from this comments section.

  7. There is a crucial difference between p-values and Bayes Factors in this context: if you keep running more subjects with BF, you are either going to converge towards 'relative evidence for H0' or to 'relative evidence for H1' (see Felix Schönbrodt's reply). However, if you keep running more subject with NHST, at some point it will be significant, unless your subjects are collectively fooling you by using a very good random generator to generate their responses. So while the BF gives you a clear guideline on what to do, the p-value leaves a large part of the decision on whether to stop and claim some effect (or the lack thereof) to you. I admit that I am not familiar with COAST, but my instinctive questions are: a) does the COAST method solves this paradox? and b) how does this method justify claiming support for H0 on the basis of a p-value?

    1. I had the same instinctive questions you have. My answers would be: a) If your p-value is >.36, you're out (you're TOAST;)); this could happen early on in the proceedings. b) This method does not allow you to claim support for H0, just that you've failed to reject it.

  8. This is a nice post. I just ran another simulation of this method. The simulation is of a basic between-subjects t-test with a d of 0 (no effect). I generated 2 groups of 20 subjects then ran a t-test. If the p value was below .01 or above .36 the simulation would end, else, I'd get another 40 subjects (20 per cell) and run another t-test. This continued until one of the 2 thresholds was achieved (<.01 or >.36). I repeated this process 100,000 times.

    The mean n (per cell) was just under 49, which seems logistically reasonable. The max n (per cell) was 48,240. Which means that this procedure would require us, sometimes, to run 100,000 participants. Logistically, this is not so reasonable.

    1. Interesting results. I agree that running 100,000 subjects is usually not feasible. What do you think of the solution proposed by replicationindex above?

    2. Well, it looks like replicationindex's concerns about alpha inflation turn out to not be problematic (in the situations that I simulated: a single between-subjects t-test). As for the biased effect size concern, I modified my script so that I could modify the actual difference between the groups I sampled from (i.e., so I could change the population effect size). I then ran your algorithm using a max N of 100 per condition. Here's a graph of the results:

      The black line is the observed d (the mean difference between groups for all 100,000 simulations) for each population d. The red dashed line is what the observed d would look like if it perfectly mirrored the actual d. So, it looks like theres some bias (under-estimating with a small d and overestimating with a large d), but it's not dramatic (~.05). Here are the actual values:

      population_d = .3
      observed_d =.251

      population_d = .4
      observed_d = .374

      population_d = .5
      observed_d = .502

      population_d = .6
      observed_d = .630

      population_d = .7
      observed_d = .747

      population_d = .8
      observed_d = .857

      population_d = .9
      observed_d = .959

      population_d = 1
      observed_d = 1.05

  9. Interesting. I just ran a few more simulations. It was the same procedure as in my earlier post, with one more stopping rule: if max n (per cell) was greater than x, the data collection would cease. Here are the FA rates for various values of X:

    max n = 50 per cell: FA rate = .0475
    max n = 100 per cell: FA rate = .0391
    max n = 200 per cell: FA rate = .0369
    max n = 500 per cell: FA rate = .0352

    This is surprisingly good news. We're appropriately controlling our FA rate even with only 50 people per cell, which is logistically doable.

    Here's a link to my R code:

  10. Great post! I wanted to use this but I wasn't sure what I needed as lower bound to achieve enough power, so I did a simulation. For a t-test for independent means, it seems that about 55% of the N needed for fixed sample stopping is enough to achieve .80 power.
    My code is here:


Een reactie posten