How we did not fool ourselves : Reflections on adopting a flexible sequential testing method

Five years ago, I wrote a post about COAST, a method of sequential testing proposed by Frick (1998). Time for a follow-up. In this guest post, my student Yiyun Liao, with the help of my colleague Steven Verheyen, writes about her experience using this method. She concludes with some tips about using COAST and about the importance of performing simulations and preregistering your research.

Yiyun Liao
Department of Psychology, Education and Child Studies
Erasmus University Rotterdam, Netherlands
liao@essb.eur.nl

Recently I, together with Drs. Katinka Dijkstra and Rolf Zwaan, conducted a study on two English prepositions: to and towards. Instead of following a conventional fixed-sample testing method, we adopted a flexible sequential testing method based on Frick’s COAST method (Frick, 1998). A very interesting case occurred after we finished our data analysis.

The Study

The study was intended to replicate what we had found in a previous study on two Dutch prepositions: naar (‘to’) and richting (‘direction/towards’). We found that both the actor’s goal (Intentionality) and the social status of the interlocutor (Context) affect the use of naar and richting in an event-description task.

Specifically, when there was a clear inference that the actor’s goal was going to the reference object in the described situation (e.g., a person carrying a trash bag and a trash bin being in the near distance), naar was used more often, compared to when there was no such inference (e.g., a person walking with nothing in hand and a trash bin being in the near distance). Moreover, richting was used more often when participants were told the interlocutor was a police officer, rather than a friend of the speaker.

We aimed to replicate the above two patterns in English by doing the same study on the two English directional prepositions to and towards. We predicted the same main effects of Intentionality and Context on the use of the two English directional prepositions.

Data collection

This study adopted Frick’s COAST method to conduct sequential analyses, as that was used in the Dutch study as well.

“In the sequential stopping rule I am proposing, the researcher can perform a statistical test at any time. If the outcome of this statistical test is p < .01, the researcher stops testing subjects and rejects the null hypothesis; if p > .36, the researcher stops testing subjects and does not reject the null hypothesis; and if .01<p<.36, more subjects are tested.”

Frick (1998, p. 691)

According to Monte Carlo simulations performed by Frick (1998), it is possible to preserve an overall alpha level in a sequential analysis provided one is committed to the above two termination criteria. There is no strict rule about the minimum number of participants a researcher should test based on this method. However, after having determined a minimum sample size, the research should be willing to stop testing more participants when a p value above .36 is found.

As in the Dutch study, we determined to test 160 participants as our first data batch (the minimum number of participants we planned to test). If p >.36 or p <.01 for each main effect we were testing (Intentionality and Context), we would stop testing. If p was within these boundaries for any one of the two main effects being predicted, we would collect another 160 participants. Considering the experimental costs (i.e., the money and time), we decided to stop at N=480 regardless of what the p values were (Lakens, 2014).

It is important to note that we had pre-registered this data collection plan, together with our materials, design, hypotheses, exclusion criteria, and analyses on the Open Science Framework (see details at: https://osf.io/7c5zh/?view_only=54cdbbb89cfb4f58a952edf8bd7331ab).

Data analysis

This is where the interesting case was discovered!

Based on the stopping rule and our pre-registration, we collected data in three rounds and thus obtained three data batches. Figure 1 presents the obtained p values for each factor (Intentionality and Context) at each data batch.

First data batch. As in the Dutch study, we performed a logistic regression analysis on our first data batch. We found a highly significant effect of Intentionality (estimate = -0.995, SE = 0.34, z = -2.919, p = .004), whereas the p value found for Context was within the boundary of .01 to .36 (estimate = 0.676, SE = 0.34, z = 1.989, p = .047). Under regular circumstances, we would have claimed that we found evidence for both factors, given that the p values for both factors were found to be lower than .05! Arguably, this would have made our study easier to publish.

However, based on our stopping rule (p<.01) and pre-registration, we could not do this. Therefore, we collected data from another 160 participants. Together with the previous 160 participants, this resulted in a second data batch that consisted of 320 participants.

Second data batch. We performed the same analysis on the second data batch. This time, the p values for both factors were within the set boundary (Intentionality estimate = -0.482, SE = 0.23, z = -2.071, p = .038; Context estimate = 0.534, SE = 0.23, z = 2.296, p = .022). We noticed that the effect of Intentionality started to wane (from p = .004 to p = .038).

Although the p values for both factors were below .05, we still could not stop and claim evidence for both factors at this point. A second chance of claiming significant effects slipped away. We then collected another 160 participants and reached the maximum number of 480 participants we intended to include.

Third data batch. The same analysis was conducted on the third data batch (480 participants). The effect of Context was found to be significant (p value was below 0.01: estimate = 0.673, SE = 0.19, z = 3.538, p < .001). This corresponds to what was predicted based on the Dutch study. However, the predicted effect of Intentionality totally disappeared (estimate = -0.323, SE = 0.19, z = -1.700, p =.089).

To conclude, we replicated the effect of Interlocutor in the English study. However, we could not replicate the effect of Intentionality. What appeared to be an easy sell on the first data batch (p = .004) turned out to be an unexpected disappointment. In order to find out why, we decided to dig deeper to explore what might be going on. Had the COAST method led us astray, robbing us of our predicted findings? We conducted a simulation study based on the collected data to find out.

A simulation study

We permuted the participants within conditions, simulating what would have happened had we recruited the participants in a different order and applied the COAST method. We repeated this procedure 1000 times. This simulation study was to make sure that we did not miss any chance of finding an effect of Intentionality (i.e., that the COAST method did not “rob” us of an effect). Specifically, we wanted to know, among the 1000 simulations, the percentages of cases that mimicked our final results (i.e., a nonsignificant Intentionality effect and a significant Context effect), and the percentages of cases that we would make another decision (i.e., double effects, zero effects, significant Intentionality effect but nonsignificant Context effect).

Table 1. the percentages of each possible result among the 1000 simulations

	percentages	Mimicking the final results?
Double significance (Intentionality < .01 & Context < .01)	1.8%	no
Zero significance (Intentionality > .36 & Context > .36)	2.9%	no
One effect (Intentionality < 0.01 & Context > 0.36)	0.1%	no
One effect (Intentionality > 0.36 & Context < 0.01)	21.5%	yes
One effect (.01 < Intentionality < .36 & Context < 0.01)	73.7%	yes

We found that among the 1000 simulations, 95.2% cases mimicked our final results, that is, nonsignificant Intentionality effect and significant Context effect (Intentionality > 0.36 & Context < 0.01 or .01 < Intentionality < .36 & Context < 0.01). In another 4.8% cases, we would make another decision. Table 1 shows the percentages of each possible result among the 1000 simulations. It should be noted that, among these 1000 simulations, with the former four results (26.3%) showed in table 1, data collection would stop at either data batch 1 or data batch 2, given that the p values for both factors were outside of the boundary we set (.01 < p < .36). 73.7% of the time we would continue data collection until we reached the final data batch (N = 480).

Reflections

Had we conducted a traditional way of data collection and adopted the common standard of alpha < .05, we could have claimed that we had found significant main effects of both Intentionality and Context after our first data batch (N = 160), and we would have reached the conclusion that the effect of Intentionality is larger than that of Context.

If we look at our simulations, however, 98.1% of the time, the effect of Intentionality was nonsignificant, while 97% of the time, the effect of Context was significant. What we found in our first data batch is actually very unusual.

Had we not pre-registered our data collection method, it would have been tempting to stop at the first data batch. We could have easily fooled ourselves and others by claiming finding the same effects as in the Dutch study. We will conduct further experiments to find out why the effect of Intentionality did not appear in the English study.

Tips

1. Our study indicates that Frick’s COAST method to conduct sequential analysis is a solid one. Researchers should consider using it, particularly when no educated estimate of the effect size can be produced in order to establish the sample size using a power analysis.

2. Simulation is a very useful method if you do not understand your results or the results were not expected.

3. Most importantly, pre-register your research design, data collection method, and data analysis and stick to it. By doing so, we can largely avoid questionable research practices, such as post-hoc analyses and p-hacking.

Acknowledgement

I would like to thank my colleague, dr. Steven Verheyen, for his help with the simulation study and the revision of the draft of this blogpost.

References

Frick, R. W. (1998). A better stopping rule for conventional statistical tests. Behavior Research Methods, Instruments, & Computers, 30(4), 690–697.

Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses. European Journal of Social Psychology, 44(7), 701-710.

Drang naar Samenhang

Zoeken in deze blog

How we did not fool ourselves : Reflections on adopting a flexible sequential testing method

Reacties