Monday, July 20, 2020

How we did not fool ourselves : Reflections on adopting a flexible sequential testing method



Five years ago, I wrote a post about COAST, a method of sequential testing proposed by Frick (1998). Time for a follow-up. In this guest post, my student Yiyun Liao, with the help of my colleague Steven Verheyen, writes about her experience using this method. She concludes with some tips about using COAST and about the importance of performing simulations and preregistering your research.






    Yiyun Liao

    Department of Psychology, Education and Child Studies

    Erasmus University Rotterdam, Netherlands

    liao@essb.eur.nl

 


Recently I, together with Drs. Katinka Dijkstra and Rolf Zwaan, conducted a study on two English prepositions: to and towards. Instead of following a conventional fixed-sample testing method, we adopted a flexible sequential testing method based on Frick’s COAST method (Frick, 1998). A very interesting case occurred after we finished our data analysis.

 

The Study

The study was intended to replicate what we had found in a previous study on two Dutch prepositions: naar (‘to’) and richting (‘direction/towards’). We found that both the actor’s goal (Intentionality) and the social status of the interlocutor (Context) affect the use of naar and richting in an event-description task.

 

Specifically, when there was a clear inference that the actor’s goal was going to the reference object in the described situation (e.g., a person carrying a trash bag and a trash bin being in the near distance), naar was used more often, compared to when there was no such inference (e.g., a person walking with nothing in hand and a trash bin being in the near distance). Moreover, richting was used more often when participants were told the interlocutor was a police officer, rather than a friend of the speaker.

 

We aimed to replicate the above two patterns in English by doing the same study on the two English directional prepositions to and towards. We predicted the same main effects of Intentionality and Context on the use of the two English directional prepositions.

 

Data collection

This study adopted Frick’s COAST method to conduct sequential analyses, as that was used in the Dutch study as well.

 

“In the sequential stopping rule I am proposing, the researcher can perform a statistical test at any time. If the outcome of this statistical test is p < .01, the researcher stops testing subjects and rejects the null hypothesis; if p > .36, the researcher stops testing subjects and does not reject the null hypothesis; and if .01<p<.36, more subjects are tested.”

Frick (1998, p. 691)

 

According to Monte Carlo simulations performed by Frick (1998), it is possible to preserve an overall alpha level in a sequential analysis provided one is committed to the above two termination criteria. There is no strict rule about the minimum number of participants a researcher should test based on this method. However, after having determined a minimum sample size, the research should be willing to stop testing more participants when a p value above .36 is found.

 

As in the Dutch study, we determined to test 160 participants as our first data batch (the minimum number of participants we planned to test). If p >.36 or p <.01 for each main effect we were testing (Intentionality and Context), we would stop testing. If p was within these boundaries for any one of the two main effects being predicted, we would collect another 160 participants. Considering the experimental costs (i.e., the money and time), we decided to stop at N=480 regardless of what the p values were (Lakens, 2014).

 

It is important to note that we had pre-registered this data collection plan, together with our materials, design, hypotheses, exclusion criteria, and analyses on the Open Science Framework (see details at: https://osf.io/7c5zh/?view_only=54cdbbb89cfb4f58a952edf8bd7331ab).

 

Data analysis

This is where the interesting case was discovered!

 


 Based on the stopping rule and our pre-registration, we collected data in three rounds and thus obtained three data batches. Figure 1 presents the obtained p values for each factor (Intentionality and Context) at each data batch.

 

First data batch. As in the Dutch study, we performed a logistic regression analysis on our first data batch. We found a highly significant effect of Intentionality (estimate = -0.995, SE = 0.34, z = -2.919, p = .004), whereas the p value found for Context was within the boundary of .01 to .36 (estimate = 0.676, SE = 0.34, z = 1.989, p = .047). Under regular circumstances, we would have claimed that we found evidence for both factors, given that the p values for both factors were found to be lower than .05! Arguably, this would have made our study easier to publish.

 

However, based on our stopping rule (p<.01) and pre-registration, we could not do this. Therefore, we collected data from another 160 participants. Together with the previous 160 participants, this resulted in a second data batch that consisted of 320 participants.

 

Second data batch. We performed the same analysis on the second data batch. This time, the p values for both factors were within the set boundary (Intentionality estimate = -0.482, SE = 0.23, z = -2.071, p = .038; Context estimate = 0.534, SE = 0.23, z = 2.296, p = .022). We noticed that the effect of Intentionality started to wane (from p = .004 to p = .038).

 

Although the p values for both factors were below .05, we still could not stop and claim evidence for both factors at this point. A second chance of claiming significant effects slipped away. We then collected another 160 participants and reached the maximum number of 480 participants we intended to include.

 

Third data batch. The same analysis was conducted on the third data batch (480 participants). The effect of Context was found to be significant (p value was below 0.01: estimate = 0.673, SE = 0.19, z = 3.538, p < .001). This corresponds to what was predicted based on the Dutch study. However, the predicted effect of Intentionality totally disappeared (estimate = -0.323, SE = 0.19, z = -1.700, p =.089).

 

To conclude, we replicated the effect of Interlocutor in the English study. However, we could not replicate the effect of Intentionality. What appeared to be an easy sell on the first data batch (p = .004) turned out to be an unexpected disappointment. In order to find out why, we decided to dig deeper to explore what might be going on. Had the COAST method led us astray, robbing us of our predicted findings? We conducted a simulation study based on the collected data to find out.

 

A simulation study

We permuted the participants within conditions, simulating what would have happened had we recruited the participants in a different order and applied the COAST method. We repeated this procedure 1000 times. This simulation study was to make sure that we did not miss any chance of finding an effect of Intentionality (i.e., that the COAST method did not “rob” us of an effect). Specifically, we wanted to know, among the 1000 simulations, the percentages of cases that mimicked our final results (i.e., a nonsignificant Intentionality effect and a significant Context effect), and the percentages of cases that we would make another decision (i.e., double effects, zero effects, significant Intentionality effect but nonsignificant Context effect).

 

Table 1. the percentages of each possible result among the 1000 simulations

 

percentages

Mimicking the final results?

Double significance (Intentionality < .01 & Context < .01)

1.8%

no

Zero significance (Intentionality > .36 & Context > .36)

2.9%

no

One effect (Intentionality < 0.01 & Context > 0.36)

0.1%

no

One effect (Intentionality > 0.36 & Context < 0.01)

21.5%

yes

One effect (.01 < Intentionality < .36 & Context < 0.01)

73.7%

yes

 


We found that among the 1000 simulations, 95.2% cases mimicked our final results, that is, nonsignificant Intentionality effect and significant Context effect (Intentionality > 0.36 & Context < 0.01 or .01 < Intentionality < .36 & Context < 0.01). In another 4.8% cases, we would make another decision. Table 1 shows the percentages of each possible result among the 1000 simulations. It should be noted that, among these 1000 simulations, with the former four results (26.3%) showed in table 1, data collection would stop at either data batch 1 or data batch 2, given that the p values for both factors were outside of the boundary we set (.01 < p < .36). 73.7% of the time we would continue data collection until we reached the final data batch (N = 480).

 

Reflections

Had we conducted a traditional way of data collection and adopted the common standard of alpha < .05, we could have claimed that we had found significant main effects of both Intentionality and Context after our first data batch (N = 160), and we would have reached the conclusion that the effect of Intentionality is larger than that of Context.

 

If we look at our simulations, however, 98.1% of the time, the effect of Intentionality was nonsignificant, while 97% of the time, the effect of Context was significant. What we found in our first data batch is actually very unusual.

 

Had we not pre-registered our data collection method, it would have been tempting to stop at the first data batch. We could have easily fooled ourselves and others by claiming finding the same effects as in the Dutch study. We will conduct further experiments to find out why the effect of Intentionality did not appear in the English study.

 

Tips

1.     Our study indicates that Frick’s COAST method to conduct sequential analysis is a solid one. Researchers should consider using it, particularly when no educated estimate of the effect size can be produced in order to establish the sample size using a power analysis.

2.     Simulation is a very useful method if you do not understand your results or the results were not expected.

3.     Most importantly, pre-register your research design, data collection method, and data analysis and stick to it. By doing so, we can largely avoid questionable research practices, such as post-hoc analyses and p-hacking.

 

 

Acknowledgement

I would like to thank my colleague, dr. Steven Verheyen, for his help with the simulation study and the revision of the draft of this blogpost.

 

 

References

 

Frick, R. W. (1998). A better stopping rule for conventional statistical tests. Behavior Research Methods, Instruments, & Computers, 30(4), 690–697.

 

Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses. European Journal of Social Psychology, 44(7), 701-710.

Monday, June 22, 2020

My Memories of Anders Ericsson

On June 17, Anders Ericsson, a giant in the field of psychology, passed away. Neil Charness, who knew Anders Ericsson much better than I did, has written a heartfelt and beautiful in memoriam. Here, I am merely describing some memories of the 13 years that Anders was my colleague.

At the 1993 Annual Meeting of the Psychonomic Society in Washington DC, I was on my way to a poster session, when I was approached by a bearded and somewhat burly gentleman in a blue blazer. He was extremely polite, introducing himself by making slight bow, which I thought was both quaint and endearing. It was Anders Ericsson. I told him I knew his work on protocol analysis and, in fact, owned his 1984 book with Herbert Simon, which I’d bought as an undergraduate student. He told me there was an assistant professor position in his department and if I considered applying.

Several months later I had accepted the position. And a few months after that my small family and I moved to Tallahassee in June, 1994. Anders had very generously offered to pay for my summer salary out of his endowment, allowing for a very smooth transition, making me feel at home in the department right away. One of my earliest memories from that period is when Anders’ wife, Natalie Sachs-Ericsson, very kindly took me on a shopping trip to buy carpet for my new office in the old Psychology Building. Another early memory is my daughter, Isabel, who was 2 at the time, being presented with a toy animal, a lamb, from Natalie and Anders when they came to visit. That lamb is still in my house. 

Anders’ office was in the Kellogg Research Building next to the Psychology building. The two buildings were connected by a bridge and I have fond memories of standing on that bridge discussing science with Anders, while he smoked a cigarette and we looked at the Spanish-moss covered live oak and the trucks arriving to deliver test animals for our neuroscience colleagues. Although Anders spoke near-perfect English, he maintained a slight Swedish lilt and his sentences were liberally sprinkled with the adverbs essentially and basically, which I suspected were strategic devices deployed to give him more time to think about what to say next.

Years later, we would still be standing on that bridge. Anders had quit smoking at this point, but as soon as we approached the bridge, his hand would still go to his breast pocket, reaching for cigarettes that no longer were there. But his zest for discussion had not left him along with his smoking habit, so we still spent much time debating science topics. On one occasion, I remember being so engrossed in the conversation that I forgot I had to teach a class. I had no time to go back to my office because the students were already waiting and so had to go in empty-handed, much to Anders’ amusement. This was fun! Anders yelled after me, as I rushed off to my students.


We would often leave the bridge to go into the Psychology Building to get coffee. At the door of the psychology building, a strange ritual would invariably unfold, in which Anders and I tried to out-polite one another. After you, one us us would say. No after you, the other would say, a back-and-forth that often lasted for half a minute or so. In the end, I think we were both polite enough to play it to a draw. I probably “won” about as many times by letting him go first as I “lost” by letting him let me go first. We both enjoyed this game, maybe because it reminded us of our common European roots. 

To state that Anders loved to discuss is to understate things. He typically went on the offence and questioned the theoretical justification for your hypothesis or your use of a particular method. Every cognitive psychology colloquium speaker would be subjected to an interrogation by Anders on why they were using their method of choice and not verbal protocols. Wouldn’t you want to know what the people in your experiments are actually thinking? I remember him asking on more than one occasion. It usually left the guest speaker struggling for an answer.

Anders had an interesting style of mentoring a junior colleague. One year into my tenure track he said that it was all fine and good to have empirical papers, but if one wanted to get tenure, one needed a paper in Psychological Review or Psychological Bulletin. I could see his point on some abstract level, but as it pertained to me, I thought it would be a big risk. Writing such a paper would take a lot of time, time that could be spent on more empirical papers, and what if neither of these journals would accept my manuscript? What would be my alternative outlets? I couldn't think of any. 

For a moment it felt as if, on my ladder toward tenure, someone just had taken out a few rungs above me. On the other hand, it was very motivating that someone like Anders would think I was up to the challenge. What particularly convinced me was Anders’ point that you should want to do work that is cited 50 years from now. I set to work and in 1998 my Psych Bulletin paper with Gabe Radvansky appeared. It still is (by far) my most-cited paper and continues to be well cited to this day. We’re not even at the halfway mark of the 50 years Anders had in mind but I will forever remain grateful for the challenge he put in front of me on that bridge.

In our article, Radvansky and I were making use of the notion of long-term working memory, which Anders had developed with Walter Kintsch. I had hoped that this would form the basis for a collaboration with Anders but by then he was well into his expertise research and his focus seemed to have shifted away a little from long-term working memory. Without long-term working memory, finding a connection between research on language comprehension and expertise proved more difficult than I had imagined. 

At some point, Anders and I, still standing on that bridge, had come up with the topic of interpreting, a speaker translating someone else’s speech on the fly. The issue that interested us was how much comprehension would go on in such a task. We had devised an experiment, of the type we both liked: simple, clear, and clever (I think, retrospectively). It involved people translating French into English. The target phrase could only be translated in one of two ways, one of which would indicate comprehension (cross-sentential integration), whereas the other would indicate word-level translation. This would allow us to examine the effect of expertise. An expert interpreter would be able to integrate information across sentences, and thus comprehend, whereas a novice would have to resort to word-level translation. A graduate student in the French department ran the study. I’m not sure what happened to that study. My best recollection is that when the graduate student moved on, neither Anders nor I felt the study to be sufficiently close to our own interests to further pursue it. It turned out that it was a lot easier for us to stand on a bridge and discuss research than to build a bridge between our interests. 

Anders was a voracious reader, which is part of why it was so much fun to talk to him. He was a true intellectual and a deep thinker and you could talk meaningfully about an astonishing variety of topics with him. As behooves a true intellectual, he would be reading many different things simultaneously. So it was not uncommon to see a book open on his desk on Wolfgang Amadeus Mozart’s family history, an edited volume on sport psychology, a book on Elo-ratings in chess, a book on management, as well as various issues of Psych Review (his journal of choice), and photocopies of countless more articles. The picture of Anders’ office is not an exaggeration. I have seen worse. I remember once coming into his office and thinking he wasn’t there when he suddenly emerged from behind the stacks of books on his desk. He was probably writing another Psych Review article. I joked that if he went on like that we’d have to call a rescue team to excavate him from his office. His response was that, indeed, maybe he’d gotten carried away a little.

As much as Anders enjoyed discussions, so little did he care for small talk. In fact, he tried to turn small talk into a more meaningful discussion at the first opportunity he saw. I remember one such case. By then I was head of the CBS (Cognitive, Behavioral, and Social Psychology) area and we were assembling before a meeting. Anders and I had already arrived. Momentarily forgetting who my conversation partner was, I mentioned, just to shoot the breeze, a newspaper article I’d read that morning about intelligent behavior by an octopus. Before I realized it, Anders saw an opening and said something to the effect of In what way do you consider this behavior to be intelligent? Explain yourself, Sir!. The response I had been looking for was more along the lines of Golly, how ‘bout them octopuses! What will they be up to next? I was relieved to see the other members of the area file in, so that we could start the meeting instead of me having to jump into the breach for the intellectual prowess of cephalopods.

At the same time Anders' desire for conversations to be meaningful was what I, and many others, liked so much about him. It made every conversation with him more draining and formal than you might like at times, yes, but he always spoke with enthusiasm, humor, and a twinkle in his bright blue eyes, as Neil put it in his in memoriam (he forgot to mention the wiggling eyebrows), Just like his bow, Anders’ conversational style was both quaint and endearing. He told me that someone once had tried to call him “Andy.” We had a good laugh about this. Anders was definitely not an Andy! After all, an Andy would have said How ‘bout them octopuses! 

Although Anders was Anders not an Andy, he didn't seem to want to move back to Sweden either. I once asked him directly about this. His response was that his focus on his work was so strong that he could live anywhere, as long as he was able do his work. I believed him. I was less single-minded about my work and did feel the pull of my home country. I moved back to the Netherlands in 2007 but I have very warm memories of my13 years in Tallahassee as Anders' colleague.

I was very sad to learn about his passing last week. Only a few weeks ago, I had recommended his book with Herbert Simon on protocol analysis to one of my graduate students. It is now cited in a manuscript that we are about to submit. I like to think that Anders would have liked some of the experiments in it, as they are somewhat similar to the the experiment on interpreting he and I designed all these years ago.

I conclude with some characteristics of Anders as I observed them that are worth emulating. I’m not calling them things I learned from Anders, as I cannot claim to have mastered any of them. 

Read and think broadly and deeply. In Anders’ way of thinking, relevant information about a topic can come from disparate sources but to see and articulate their relevance, you had to think deeply. Your work would be so much the better for it. He did not have much time for researchers who only use a single method and can see the world only through the lens of that method.

Only conduct an experiment when you have thought everything through and you are convinced that: (1) the question is worth asking and (2) the experiment is the best way to answer that question. I heard that Anders used to tell his graduate students that they had to convince him of the experiment’s worth before they were allowed into the lab. In a publish-or-perish culture, this is bound to be frustrating, but imagine the state of the field if every advisor had taken Anders’ approach!

Go for the biggest effect! This was something Anders ad learned from his time with Herb Simon, as he always pointed out. Again, imagine the state of the field if every advisor had followed Anders’ approach!

Be curious about what the participants in your experiments are actually thinking! You can have them press buttons, measure their eye movements, or record their brain activation but in many cases it might be more informative to get at their thought processes more directly. I am becoming more and more convinced of the wisdom of this myself.

Do work that will be cited 50 years from now. Don’t waste your time on smaller projects or administrative duties. And hide behind a wall of books if they come looking for you.

Always raise the bar, not just for others but also for yourself. Anders was relentless on this score, no doubt inspired by his research on expertise. His usual approach was to ask: What would be the best way to accomplish this or that? He would then go on to ask Who are the best people you can think of in this area?. Next, he would ask What are they doing that you are not doing?, followed by How can you start doing what they are doing?. Most researchers lack Anders’ fortitude of mind and will to do this relentlessly, but it is a mindset worth aspiring to.


References

Ericsson, K. A., & Kintsch, W. (1995). Long-term working memory. Psychological Review, 102
211-245.
Ericsson, K. A., & Simon, H.A. (1980). Verbal reports as data. Psychological Review, 87, 215-
251. 
Ericsson, K. A., & Simon, H.A. (1984). Protocol analysis: Verbal reports as data. Cambridge,  
MA: MIT Press
Zwaan, R.A., & Radvansky, G.A. (1998). Situation models in language comprehension and 
memory. Psychological Bulletin, 123, 162-185.