p=.20, what now? Adventures of the Good Ship DataPoint

his blog, Drang naar Samenhang will feature posts in Dutch from now on—but no worries, English speakers, I’ve got you covered too. I have launched Substack newsletter called Craving Coherence: https://rolfzwaan.substack.com.

You don’t need to subscribe to read the posts—just hit “No thanks” if prompted. Of course, I’d really appreciate it if you do sign up. It’s completely free!

So what is the newsletter about?

Why do we search for patterns, craft narratives, and cling to meaning? Craving Coherence explores the psychology of understanding—the mental shortcuts, biases, and frameworks that shape how we interpret reality. From cognitive science to philosophy, this newsletter examines how our minds construct coherence in an often chaotic world—and what happens when they fail.

I hope to see you there!

Back to the original post:

You’ve dutifully conducted a power analysis, defined your sample size, and conducted your experiment. Alas, p=.20. What now? Let’s find out.

The Good Ship DataPoint*

Perspectives on Psychological Science’s first registered replication project, RRR1, was targeted at verbal overshadowing, the phenomenon that describing a visual stimulus, in this case a human face, is detrimental to later recognition of this face compared to not describing the stimulus. A meta-analysis of 31 direct replications of the original finding provided evidence of verbal overshadowing. Subjects who described the suspect were 16% less likely to make a correct identification than subjects who performed a filler task.

One of my students wanted to extend (or conceptually replicate) the verbal overshadowing effect for her master’s thesis by using different stimuli and a different distractor task. I’m not going to talk about the contents of the research here. I simply want to address the question that’s posed in the title of this post: p=.20, what now? Because p=.20 is what we found after having run 148 subjects, obtaining a verbal overshadowing effect of 9% rather than RRR1's 16%.**

Option 1. The effect is not significant, so this conceptual replication “did not work,” let’s file drawer the sucker. This response is probably still very common but it contributes to publication bias.

Option 2. We consider this a pilot study and now perform a power analysis based on it and run a new (and much larger) batch of subjects. The old data are now meaningless for hypothesis testing. This is better than option 1 but is rather wasteful. Why throw away a perfectly good data set?

Option 3. Our method wasn’t sensitive enough. Let’s improve it and then run a new study. Probably a very common response. But it may be premature and is not guaranteed to lead to a more decisive result. And you’re still throwing away the old data (see option 1).

Liverpool FC, victorious in the 2005 Champions League final

in Istanbul after overcoming a 3-0 deficit against AC Milan

Option 4. The effect is not significant, but if we also report the Bayes factor, we can at least say something meaningful about the Null hypothesis and maybe get it published. This seems to become more common nowadays. It is not a bad idea as such, but it is likely to get misinterpreted as: H0 is true (even by the researchers themselves). The Bayes factor tells us something about the support for a hypothesis relative to some other hypothesis given the data such as they are. And what the data are here is: too few. We found BF10= .21, which translates to about 5 times more evidence for H0 than for H1, but this is about as meaningful as the score in a soccer match after 30 minutes of play. Sure, H0 is ahead but H1 might well score a come-from-behind victory. There are after all 60 more minutes to play!

Option 5. The effect is not significant but we’ll keep on testing until it is. Simmons et al. have provided a memorable illustration of how problematic optional stopping is. In his blog, Ryne Sherman describes a Monte Carlo simulation of p-hacking, showing that it can inflate the false positive rate from 5% to 20%. Still, the intuition that it would be useful to test more subjects is a good one. And that leads us to…

Option 6. The result is ambiguous, so let’s continue testing—in a way that does not inflate the Type I error rate—until we have decisive information or we've run out of resources. Researchers have proposed several ways of sequential testing that does preserve the normal error rate. Eric-Jan Wagenmakers and colleagues show how repeated testing can be performed in a Bayesian framework and Daniël Lakens has described sequential testing as it is performed in the medical sciences. My main focus will be on a little-known method proposed in psychology by Frick (1998), which to date has been cited only 17 times in Google Scholar. I will report Bayes factors as well. The method described by Lakens could not be used in this case because it requires one to specify the number of looks a priori.

Frick’s method is called COAST (composite open adaptive sequential test). The idea is appealingly simple: if your effect is >.01 and <.36, keep on testing until the p-value crosses one of these limits.*** Frick’s simulations show that this procedure keeps the overall alpha level under .05. Given that after the first test our p was between the lower and upper limits, our Good Ship DataPoint was in deep waters. Therefore, we continued testing. We decided to add subjects in batches of 60 (barring exclusions) so as to not overshoot and yet make our additions substantive. If DataPoint failed to reach shore before we'd reached 500 subjects, we would abandon ship.

Voyage of the Good Ship Data Point on the Rectangular Sea of Probability

Batch 2: N_total=202, p=.047. People who use optional testing would stop here and declare victory: p<.05! (Of course, they wouldn’t mention that they’d already peeked.) We’re using COAST, however, and although the good ship DataPoint is in the shallows of the Rectangular Sea of Probability, it has not reached the coast. And BF10=0.6, still leaning toward H0.

Batch 3: N_total = 258, p=.013, BF10=1.95. We’re getting encouraging reports from the crow’s nest. The DataPoint crew will likely not succumb to scurvy after all! And the BF10 now favors H1.

Batch 4: N_total =306, p=.058, BF10=.40. What’s this??? The wind has taken a treacherous turn and we’ve drifted away from shore. Rations are getting low--mutiny looms. And if that wasn’t bad enough, BF is <1 again. Discouraged but not defeated, DataPoint sails on.

Batch 5: N_total =359, p=.016, BF10=1.10. Heading back in the right direction again.

Batch 6: N_total =421, p=.015, BF=1.17. Barely closer. Will we reach shore before we all die? We have to ration the food.

Batch 7: N_total =479, p=.003, BF10=4.11. Made it! Just before supplies ran out and the captain would have been keelhauled. The taverns will be busy tonight.

Some lessons from this nautical exercise:

(1) More data=better.

(2) We now have successfully extended the verbal overshadowing effect, although we found a smaller effect, 9% after 148 subjects and 10% at the end of the experiment.

(3) Although COAST gave us an exit strategy, BF10=4.11 is encouraging but not very strong. And who knows if it will hold up? Up to this point it has been quite volatile.

(4) Our use of COAST worked because we were using Mechanical Turk. Adding batches of 60 subjects would be impractical in the lab.

(5) Using COAST is simple and straightforward. It preserves an overall alpha level of .05. I prefer to use it in conjunction with Bayes factors.

(6) It is puzzling that methodological solutions to a lot of our problems are right there in the psychological literature but that so few people are aware of them.

Coda

In this post, I have focused on the application of COAST and largely ignored, for didactical purposes, that this study was a conceptual replication. More about this in the next post.

Footnotes

Acknowledgements: I thank Samantha Bouwmeester, Peter Verkoeijen, and Anita Eerland for helpful comments on an earlier version of this post. They don't necessarily agree with me on all of the points raised in the post.
*Starring in the role of DataPoint is the Batavia, a replica of a 17th century Dutch East Indies ship, well worth a visit.
** The original study, Schooler and Engstler-Schooler (1990), has a sample of 37 subjects and the RRR1 studies typically had 50-80 subjects. We used chi-square tests to compute p-values. Unlike the replication studies, we did not collapse the conditions in which subjects made a false identification and in which they claimed the suspect was not in the lineup because we thought these were two different kinds of responses. I computed Bayes factors using the BayesFactor package in R. I used the contingencyTableBF function with sampleType = "indepMulti", fixedMargin = "rows", priorConcentration= 1. In this analysis, we separated false alarms from misses, unlike in the replication experiments. This precluded us, however, from using one-sided tests.
*** For this to work, you need to decide a priori to use COAST. This means, for example, that when your p-value is >.01 and <.05 after the first batch, you need to continue testing rather than conclude that you've obtained a significant effect.

Reacties

Daniel J. Simons7 mei 2015 om 16:46
I'm curious whether the COAST approach would be invalidated by stopping with N=500 if you hadn't surpassed the thresholds. That is, do the thresholds depend on continuing to test until you pass either the upper or lower one. If so, then this wouldn't be a particularly useful approach -- you'd have no way to know how many participants you might need to test, and you would be obligated to continue testing indefinitely. Presumably simulations could show whether or not stopping with a fixed N changes the overall false positive rate. For the sake of argument, let's assume that stopping at some fixed N doesn't invalidate the approach. Had you reached N=500 without surpassing one of the p cutoffs, what could you conclude?

(Note: I tried posting a variant of this comment a few minutes ago, but the site seemed to lose it. Sorry if it shows up twice.)
BeantwoordenVerwijderen
Reacties
Unknown7 mei 2015 om 18:08
"in a way that does not inflate the Type I error rate"

It has to inflate the Type I error rate. The Type I error rate is 0.05 once you finished your first batch. If you got significant results after the first batch you would not use COAST. Above the 0.05, you have to add the Type I error rate added by using COAST. It may be small, but it is necessarily bigger than 0. You cannot hold Type I at 0.05 using sequential testing unless you decide to use it in advance and adjust alpha for the first batch accordingly.
BeantwoordenVerwijderen
Reacties
Anoniem7 mei 2015 om 19:49
Great post. And Frick deserves more attention and citations.

It is important to stress that type-I error probabilities (p-values) increase with the number of tests that are being conducted. Moreover, biases in effect sizes will be greater if testing stops after each new data point is collected and p reaches criterion value. To avoid these problems it makes sense to increase in meaningful batches. A fixed N = 60 may seem reasonable, but sampling error decreases in a non-linear fashion. So it would be more efficient to increase sample sizes in an exponential function (50, 100, 200, 400). The limit of data collection can be determined with a priori power analysis. Would you give up when the effect size is small (d = .2) and needs N = 800 participants (between-subject design).
BeantwoordenVerwijderen
Reacties
Anoniem8 mei 2015 om 09:27
Hi Rolf,

I liked this post. I'm curious after having thought about it a little bit.

You say that the Bayes factors aren't really interesting to you or meaningful after batch1 because surely things might change with the next additions of new data. 60 more minutes of play and all that. And on twitter you said the bayes factor would be meaningful once it stabilizes. But had the results come in more conveniently for you (read: ev+ for the alternative in batch1), wouldn't you have stopped the experiment and simply taken the bayes factor as is? There would not have been 60 more minutes of play in that case, and you would not have thought your data were too few. So are they only uninteresting or not meaningful when they are not convenient for you?

Perhaps I missed something subtle but that is my perception of your comment in the post and on twitter. Of course interest level is wholly subjective and personal, so I'm interested to hear your thoughts.

-- Clearly this assumes you would have stopped the experiment after batch1 if the results were different. But as you say, you did plan to initially stop after batch1 so I don't think it is too strong of an assumption on my part.
BeantwoordenVerwijderen
Reacties
thom8 mei 2015 om 09:48
I'm glad Frick's work is getting the attention it deserves. I covered it in my book and have mentioned it reviews of a couple papers that cover similar ground but didn't seem aware of COAST and CLAST.

Frick presents several simulations of the COAST strategy and it is surprisingly hard to break - so it is easy to get Type I error rates a little above .05 using COAST, but under reasonable conditions it seems hard to get COAST to misbehave badly. It will also tend to be more efficient than using a fixed stopping rule.
BeantwoordenVerwijderen
Reacties
Greg Francis8 mei 2015 om 11:27
It seems to me that your question remains unanswered, it is just rephrased as: p=.003, what now? You could also rephrase it as: BF10=4.11, what now?

If there is a consequence to your investigation, then it should be clear what to do next. (e.g., you run the follow-up study, you decide the effect is too weak to be the basis of a master's thesis, you change your theory, or you advise the justice system to discount eye-witness testimony.) You could have answered any of those questions after batch 1 with the data that was available then. If there was no answer after batch 1, then I cannot see how there can be an answer after batch 7.

If there is no consequence to your investigation, then you simply report what you found and move on; perhaps with the hope that the data will help someone else in the future. Those future researchers might thank you for running the additional subjects, or they might feel it was overkill.
BeantwoordenVerwijderen
Reacties
Unknown8 mei 2015 om 15:41
There is a crucial difference between p-values and Bayes Factors in this context: if you keep running more subjects with BF, you are either going to converge towards 'relative evidence for H0' or to 'relative evidence for H1' (see Felix Schönbrodt's reply). However, if you keep running more subject with NHST, at some point it will be significant, unless your subjects are collectively fooling you by using a very good random generator to generate their responses. So while the BF gives you a clear guideline on what to do, the p-value leaves a large part of the decision on whether to stop and claim some effect (or the lack thereof) to you. I admit that I am not familiar with COAST, but my instinctive questions are: a) does the COAST method solves this paradox? and b) how does this method justify claiming support for H0 on the basis of a p-value?
BeantwoordenVerwijderen
Reacties
Unknown8 mei 2015 om 18:11
This is a nice post. I just ran another simulation of this method. The simulation is of a basic between-subjects t-test with a d of 0 (no effect). I generated 2 groups of 20 subjects then ran a t-test. If the p value was below .01 or above .36 the simulation would end, else, I'd get another 40 subjects (20 per cell) and run another t-test. This continued until one of the 2 thresholds was achieved (<.01 or >.36). I repeated this process 100,000 times.

The mean n (per cell) was just under 49, which seems logistically reasonable. The max n (per cell) was 48,240. Which means that this procedure would require us, sometimes, to run 100,000 participants. Logistically, this is not so reasonable.
BeantwoordenVerwijderen
Reacties
Unknown8 mei 2015 om 18:36
Interesting. I just ran a few more simulations. It was the same procedure as in my earlier post, with one more stopping rule: if max n (per cell) was greater than x, the data collection would cease. Here are the FA rates for various values of X:

max n = 50 per cell: FA rate = .0475
max n = 100 per cell: FA rate = .0391
max n = 200 per cell: FA rate = .0369
max n = 500 per cell: FA rate = .0352

This is surprisingly good news. We're appropriately controlling our FA rate even with only 50 people per cell, which is logistically doable.

Here's a link to my R code: https://www.dropbox.com/s/u4b63mrqnuqj7i4/Frick.R?dl=0

BeantwoordenVerwijderen
Reacties
Sam26 december 2016 om 19:01
Great post! I wanted to use this but I wasn't sure what I needed as lower bound to achieve enough power, so I did a simulation. For a t-test for independent means, it seems that about 55% of the N needed for fixed sample stopping is enough to achieve .80 power.
My code is here: https://github.com/samuelfranssens/frick1998/blob/master/frick.R
BeantwoordenVerwijderen
Reacties

Reactie toevoegen

Drang naar Samenhang

Zoeken in deze blog

p=.20, what now? Adventures of the Good Ship DataPoint

Reacties

Een reactie posten