Monday, July 20, 2020

How we did not fool ourselves : Reflections on adopting a flexible sequential testing method



Five years ago, I wrote a post about COAST, a method of sequential testing proposed by Frick (1998). Time for a follow-up. In this guest post, my student Yiyun Liao, with the help of my colleague Steven Verheyen, writes about her experience using this method. She concludes with some tips about using COAST and about the importance of performing simulations and preregistering your research.






    Yiyun Liao

    Department of Psychology, Education and Child Studies

    Erasmus University Rotterdam, Netherlands

    liao@essb.eur.nl

 


Recently I, together with Drs. Katinka Dijkstra and Rolf Zwaan, conducted a study on two English prepositions: to and towards. Instead of following a conventional fixed-sample testing method, we adopted a flexible sequential testing method based on Frick’s COAST method (Frick, 1998). A very interesting case occurred after we finished our data analysis.

 

The Study

The study was intended to replicate what we had found in a previous study on two Dutch prepositions: naar (‘to’) and richting (‘direction/towards’). We found that both the actor’s goal (Intentionality) and the social status of the interlocutor (Context) affect the use of naar and richting in an event-description task.

 

Specifically, when there was a clear inference that the actor’s goal was going to the reference object in the described situation (e.g., a person carrying a trash bag and a trash bin being in the near distance), naar was used more often, compared to when there was no such inference (e.g., a person walking with nothing in hand and a trash bin being in the near distance). Moreover, richting was used more often when participants were told the interlocutor was a police officer, rather than a friend of the speaker.

 

We aimed to replicate the above two patterns in English by doing the same study on the two English directional prepositions to and towards. We predicted the same main effects of Intentionality and Context on the use of the two English directional prepositions.

 

Data collection

This study adopted Frick’s COAST method to conduct sequential analyses, as that was used in the Dutch study as well.

 

“In the sequential stopping rule I am proposing, the researcher can perform a statistical test at any time. If the outcome of this statistical test is p < .01, the researcher stops testing subjects and rejects the null hypothesis; if p > .36, the researcher stops testing subjects and does not reject the null hypothesis; and if .01<p<.36, more subjects are tested.”

Frick (1998, p. 691)

 

According to Monte Carlo simulations performed by Frick (1998), it is possible to preserve an overall alpha level in a sequential analysis provided one is committed to the above two termination criteria. There is no strict rule about the minimum number of participants a researcher should test based on this method. However, after having determined a minimum sample size, the research should be willing to stop testing more participants when a p value above .36 is found.

 

As in the Dutch study, we determined to test 160 participants as our first data batch (the minimum number of participants we planned to test). If p >.36 or p <.01 for each main effect we were testing (Intentionality and Context), we would stop testing. If p was within these boundaries for any one of the two main effects being predicted, we would collect another 160 participants. Considering the experimental costs (i.e., the money and time), we decided to stop at N=480 regardless of what the p values were (Lakens, 2014).

 

It is important to note that we had pre-registered this data collection plan, together with our materials, design, hypotheses, exclusion criteria, and analyses on the Open Science Framework (see details at: https://osf.io/7c5zh/?view_only=54cdbbb89cfb4f58a952edf8bd7331ab).

 

Data analysis

This is where the interesting case was discovered!

 


 Based on the stopping rule and our pre-registration, we collected data in three rounds and thus obtained three data batches. Figure 1 presents the obtained p values for each factor (Intentionality and Context) at each data batch.

 

First data batch. As in the Dutch study, we performed a logistic regression analysis on our first data batch. We found a highly significant effect of Intentionality (estimate = -0.995, SE = 0.34, z = -2.919, p = .004), whereas the p value found for Context was within the boundary of .01 to .36 (estimate = 0.676, SE = 0.34, z = 1.989, p = .047). Under regular circumstances, we would have claimed that we found evidence for both factors, given that the p values for both factors were found to be lower than .05! Arguably, this would have made our study easier to publish.

 

However, based on our stopping rule (p<.01) and pre-registration, we could not do this. Therefore, we collected data from another 160 participants. Together with the previous 160 participants, this resulted in a second data batch that consisted of 320 participants.

 

Second data batch. We performed the same analysis on the second data batch. This time, the p values for both factors were within the set boundary (Intentionality estimate = -0.482, SE = 0.23, z = -2.071, p = .038; Context estimate = 0.534, SE = 0.23, z = 2.296, p = .022). We noticed that the effect of Intentionality started to wane (from p = .004 to p = .038).

 

Although the p values for both factors were below .05, we still could not stop and claim evidence for both factors at this point. A second chance of claiming significant effects slipped away. We then collected another 160 participants and reached the maximum number of 480 participants we intended to include.

 

Third data batch. The same analysis was conducted on the third data batch (480 participants). The effect of Context was found to be significant (p value was below 0.01: estimate = 0.673, SE = 0.19, z = 3.538, p < .001). This corresponds to what was predicted based on the Dutch study. However, the predicted effect of Intentionality totally disappeared (estimate = -0.323, SE = 0.19, z = -1.700, p =.089).

 

To conclude, we replicated the effect of Interlocutor in the English study. However, we could not replicate the effect of Intentionality. What appeared to be an easy sell on the first data batch (p = .004) turned out to be an unexpected disappointment. In order to find out why, we decided to dig deeper to explore what might be going on. Had the COAST method led us astray, robbing us of our predicted findings? We conducted a simulation study based on the collected data to find out.

 

A simulation study

We permuted the participants within conditions, simulating what would have happened had we recruited the participants in a different order and applied the COAST method. We repeated this procedure 1000 times. This simulation study was to make sure that we did not miss any chance of finding an effect of Intentionality (i.e., that the COAST method did not “rob” us of an effect). Specifically, we wanted to know, among the 1000 simulations, the percentages of cases that mimicked our final results (i.e., a nonsignificant Intentionality effect and a significant Context effect), and the percentages of cases that we would make another decision (i.e., double effects, zero effects, significant Intentionality effect but nonsignificant Context effect).

 

Table 1. the percentages of each possible result among the 1000 simulations

 

percentages

Mimicking the final results?

Double significance (Intentionality < .01 & Context < .01)

1.8%

no

Zero significance (Intentionality > .36 & Context > .36)

2.9%

no

One effect (Intentionality < 0.01 & Context > 0.36)

0.1%

no

One effect (Intentionality > 0.36 & Context < 0.01)

21.5%

yes

One effect (.01 < Intentionality < .36 & Context < 0.01)

73.7%

yes

 


We found that among the 1000 simulations, 95.2% cases mimicked our final results, that is, nonsignificant Intentionality effect and significant Context effect (Intentionality > 0.36 & Context < 0.01 or .01 < Intentionality < .36 & Context < 0.01). In another 4.8% cases, we would make another decision. Table 1 shows the percentages of each possible result among the 1000 simulations. It should be noted that, among these 1000 simulations, with the former four results (26.3%) showed in table 1, data collection would stop at either data batch 1 or data batch 2, given that the p values for both factors were outside of the boundary we set (.01 < p < .36). 73.7% of the time we would continue data collection until we reached the final data batch (N = 480).

 

Reflections

Had we conducted a traditional way of data collection and adopted the common standard of alpha < .05, we could have claimed that we had found significant main effects of both Intentionality and Context after our first data batch (N = 160), and we would have reached the conclusion that the effect of Intentionality is larger than that of Context.

 

If we look at our simulations, however, 98.1% of the time, the effect of Intentionality was nonsignificant, while 97% of the time, the effect of Context was significant. What we found in our first data batch is actually very unusual.

 

Had we not pre-registered our data collection method, it would have been tempting to stop at the first data batch. We could have easily fooled ourselves and others by claiming finding the same effects as in the Dutch study. We will conduct further experiments to find out why the effect of Intentionality did not appear in the English study.

 

Tips

1.     Our study indicates that Frick’s COAST method to conduct sequential analysis is a solid one. Researchers should consider using it, particularly when no educated estimate of the effect size can be produced in order to establish the sample size using a power analysis.

2.     Simulation is a very useful method if you do not understand your results or the results were not expected.

3.     Most importantly, pre-register your research design, data collection method, and data analysis and stick to it. By doing so, we can largely avoid questionable research practices, such as post-hoc analyses and p-hacking.

 

 

Acknowledgement

I would like to thank my colleague, dr. Steven Verheyen, for his help with the simulation study and the revision of the draft of this blogpost.

 

 

References

 

Frick, R. W. (1998). A better stopping rule for conventional statistical tests. Behavior Research Methods, Instruments, & Computers, 30(4), 690–697.

 

Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses. European Journal of Social Psychology, 44(7), 701-710.

Monday, June 22, 2020

My Memories of Anders Ericsson

On June 17, Anders Ericsson, a giant in the field of psychology, passed away. Neil Charness, who knew Anders Ericsson much better than I did, has written a heartfelt and beautiful in memoriam. Here, I am merely describing some memories of the 13 years that Anders was my colleague.

At the 1993 Annual Meeting of the Psychonomic Society in Washington DC, I was on my way to a poster session, when I was approached by a bearded and somewhat burly gentleman in a blue blazer. He was extremely polite, introducing himself by making slight bow, which I thought was both quaint and endearing. It was Anders Ericsson. I told him I knew his work on protocol analysis and, in fact, owned his 1984 book with Herbert Simon, which I’d bought as an undergraduate student. He told me there was an assistant professor position in his department and if I considered applying.

Several months later I had accepted the position. And a few months after that my small family and I moved to Tallahassee in June, 1994. Anders had very generously offered to pay for my summer salary out of his endowment, allowing for a very smooth transition, making me feel at home in the department right away. One of my earliest memories from that period is when Anders’ wife, Natalie Sachs-Ericsson, very kindly took me on a shopping trip to buy carpet for my new office in the old Psychology Building. Another early memory is my daughter, Isabel, who was 2 at the time, being presented with a toy animal, a lamb, from Natalie and Anders when they came to visit. That lamb is still in my house. 

Anders’ office was in the Kellogg Research Building next to the Psychology building. The two buildings were connected by a bridge and I have fond memories of standing on that bridge discussing science with Anders, while he smoked a cigarette and we looked at the Spanish-moss covered live oak and the trucks arriving to deliver test animals for our neuroscience colleagues. Although Anders spoke near-perfect English, he maintained a slight Swedish lilt and his sentences were liberally sprinkled with the adverbs essentially and basically, which I suspected were strategic devices deployed to give him more time to think about what to say next.

Years later, we would still be standing on that bridge. Anders had quit smoking at this point, but as soon as we approached the bridge, his hand would still go to his breast pocket, reaching for cigarettes that no longer were there. But his zest for discussion had not left him along with his smoking habit, so we still spent much time debating science topics. On one occasion, I remember being so engrossed in the conversation that I forgot I had to teach a class. I had no time to go back to my office because the students were already waiting and so had to go in empty-handed, much to Anders’ amusement. This was fun! Anders yelled after me, as I rushed off to my students.


We would often leave the bridge to go into the Psychology Building to get coffee. At the door of the psychology building, a strange ritual would invariably unfold, in which Anders and I tried to out-polite one another. After you, one us us would say. No after you, the other would say, a back-and-forth that often lasted for half a minute or so. In the end, I think we were both polite enough to play it to a draw. I probably “won” about as many times by letting him go first as I “lost” by letting him let me go first. We both enjoyed this game, maybe because it reminded us of our common European roots. 

To state that Anders loved to discuss is to understate things. He typically went on the offence and questioned the theoretical justification for your hypothesis or your use of a particular method. Every cognitive psychology colloquium speaker would be subjected to an interrogation by Anders on why they were using their method of choice and not verbal protocols. Wouldn’t you want to know what the people in your experiments are actually thinking? I remember him asking on more than one occasion. It usually left the guest speaker struggling for an answer.

Anders had an interesting style of mentoring a junior colleague. One year into my tenure track he said that it was all fine and good to have empirical papers, but if one wanted to get tenure, one needed a paper in Psychological Review or Psychological Bulletin. I could see his point on some abstract level, but as it pertained to me, I thought it would be a big risk. Writing such a paper would take a lot of time, time that could be spent on more empirical papers, and what if neither of these journals would accept my manuscript? What would be my alternative outlets? I couldn't think of any. 

For a moment it felt as if, on my ladder toward tenure, someone just had taken out a few rungs above me. On the other hand, it was very motivating that someone like Anders would think I was up to the challenge. What particularly convinced me was Anders’ point that you should want to do work that is cited 50 years from now. I set to work and in 1998 my Psych Bulletin paper with Gabe Radvansky appeared. It still is (by far) my most-cited paper and continues to be well cited to this day. We’re not even at the halfway mark of the 50 years Anders had in mind but I will forever remain grateful for the challenge he put in front of me on that bridge.

In our article, Radvansky and I were making use of the notion of long-term working memory, which Anders had developed with Walter Kintsch. I had hoped that this would form the basis for a collaboration with Anders but by then he was well into his expertise research and his focus seemed to have shifted away a little from long-term working memory. Without long-term working memory, finding a connection between research on language comprehension and expertise proved more difficult than I had imagined. 

At some point, Anders and I, still standing on that bridge, had come up with the topic of interpreting, a speaker translating someone else’s speech on the fly. The issue that interested us was how much comprehension would go on in such a task. We had devised an experiment, of the type we both liked: simple, clear, and clever (I think, retrospectively). It involved people translating French into English. The target phrase could only be translated in one of two ways, one of which would indicate comprehension (cross-sentential integration), whereas the other would indicate word-level translation. This would allow us to examine the effect of expertise. An expert interpreter would be able to integrate information across sentences, and thus comprehend, whereas a novice would have to resort to word-level translation. A graduate student in the French department ran the study. I’m not sure what happened to that study. My best recollection is that when the graduate student moved on, neither Anders nor I felt the study to be sufficiently close to our own interests to further pursue it. It turned out that it was a lot easier for us to stand on a bridge and discuss research than to build a bridge between our interests. 

Anders was a voracious reader, which is part of why it was so much fun to talk to him. He was a true intellectual and a deep thinker and you could talk meaningfully about an astonishing variety of topics with him. As behooves a true intellectual, he would be reading many different things simultaneously. So it was not uncommon to see a book open on his desk on Wolfgang Amadeus Mozart’s family history, an edited volume on sport psychology, a book on Elo-ratings in chess, a book on management, as well as various issues of Psych Review (his journal of choice), and photocopies of countless more articles. The picture of Anders’ office is not an exaggeration. I have seen worse. I remember once coming into his office and thinking he wasn’t there when he suddenly emerged from behind the stacks of books on his desk. He was probably writing another Psych Review article. I joked that if he went on like that we’d have to call a rescue team to excavate him from his office. His response was that, indeed, maybe he’d gotten carried away a little.

As much as Anders enjoyed discussions, so little did he care for small talk. In fact, he tried to turn small talk into a more meaningful discussion at the first opportunity he saw. I remember one such case. By then I was head of the CBS (Cognitive, Behavioral, and Social Psychology) area and we were assembling before a meeting. Anders and I had already arrived. Momentarily forgetting who my conversation partner was, I mentioned, just to shoot the breeze, a newspaper article I’d read that morning about intelligent behavior by an octopus. Before I realized it, Anders saw an opening and said something to the effect of In what way do you consider this behavior to be intelligent? Explain yourself, Sir!. The response I had been looking for was more along the lines of Golly, how ‘bout them octopuses! What will they be up to next? I was relieved to see the other members of the area file in, so that we could start the meeting instead of me having to jump into the breach for the intellectual prowess of cephalopods.

At the same time Anders' desire for conversations to be meaningful was what I, and many others, liked so much about him. It made every conversation with him more draining and formal than you might like at times, yes, but he always spoke with enthusiasm, humor, and a twinkle in his bright blue eyes, as Neil put it in his in memoriam (he forgot to mention the wiggling eyebrows), Just like his bow, Anders’ conversational style was both quaint and endearing. He told me that someone once had tried to call him “Andy.” We had a good laugh about this. Anders was definitely not an Andy! After all, an Andy would have said How ‘bout them octopuses! 

Although Anders was Anders not an Andy, he didn't seem to want to move back to Sweden either. I once asked him directly about this. His response was that his focus on his work was so strong that he could live anywhere, as long as he was able do his work. I believed him. I was less single-minded about my work and did feel the pull of my home country. I moved back to the Netherlands in 2007 but I have very warm memories of my13 years in Tallahassee as Anders' colleague.

I was very sad to learn about his passing last week. Only a few weeks ago, I had recommended his book with Herbert Simon on protocol analysis to one of my graduate students. It is now cited in a manuscript that we are about to submit. I like to think that Anders would have liked some of the experiments in it, as they are somewhat similar to the the experiment on interpreting he and I designed all these years ago.

I conclude with some characteristics of Anders as I observed them that are worth emulating. I’m not calling them things I learned from Anders, as I cannot claim to have mastered any of them. 

Read and think broadly and deeply. In Anders’ way of thinking, relevant information about a topic can come from disparate sources but to see and articulate their relevance, you had to think deeply. Your work would be so much the better for it. He did not have much time for researchers who only use a single method and can see the world only through the lens of that method.

Only conduct an experiment when you have thought everything through and you are convinced that: (1) the question is worth asking and (2) the experiment is the best way to answer that question. I heard that Anders used to tell his graduate students that they had to convince him of the experiment’s worth before they were allowed into the lab. In a publish-or-perish culture, this is bound to be frustrating, but imagine the state of the field if every advisor had taken Anders’ approach!

Go for the biggest effect! This was something Anders ad learned from his time with Herb Simon, as he always pointed out. Again, imagine the state of the field if every advisor had followed Anders’ approach!

Be curious about what the participants in your experiments are actually thinking! You can have them press buttons, measure their eye movements, or record their brain activation but in many cases it might be more informative to get at their thought processes more directly. I am becoming more and more convinced of the wisdom of this myself.

Do work that will be cited 50 years from now. Don’t waste your time on smaller projects or administrative duties. And hide behind a wall of books if they come looking for you.

Always raise the bar, not just for others but also for yourself. Anders was relentless on this score, no doubt inspired by his research on expertise. His usual approach was to ask: What would be the best way to accomplish this or that? He would then go on to ask Who are the best people you can think of in this area?. Next, he would ask What are they doing that you are not doing?, followed by How can you start doing what they are doing?. Most researchers lack Anders’ fortitude of mind and will to do this relentlessly, but it is a mindset worth aspiring to.


References

Ericsson, K. A., & Kintsch, W. (1995). Long-term working memory. Psychological Review, 102
211-245.
Ericsson, K. A., & Simon, H.A. (1980). Verbal reports as data. Psychological Review, 87, 215-
251. 
Ericsson, K. A., & Simon, H.A. (1984). Protocol analysis: Verbal reports as data. Cambridge,  
MA: MIT Press
Zwaan, R.A., & Radvansky, G.A. (1998). Situation models in language comprehension and 
memory. Psychological Bulletin, 123, 162-185. 


Friday, May 18, 2018

A Career Niche for Replicators?

My former colleague Roy Baumeister famously said that replication is a "career niche for bad experimenters.”* I like to use this quote in my talk. Roy is wrong, of course. As anyone who has tried to conduct a replication study knows, it requires a great deal of skill to perform replications. This leads to the question Is there a career niche for replicators?

I was asked this question yesterday when I gave a talk on Making Replication Mainstream at the marvellous Donders Institute for Cognition, Brain, and Behaviour in Nijmegen. I get asked this question regularly. My standard answer is that it is not a good career choice. Implicit in this answer is the idea that in order to become a tenured faculty member, one has to make a unique contribution to the literature. Promotion-and-tenure writers are always asked to comment on the uniqueness of a candidate’s work. Someone who only conducts replication studies would run the risk of not meeting the current requirements to become and remain faculty members.

During lunch, a group of us got to talking some more about this issue, to which I hadn't given sufficient thought, as it soon turned out.

It was pointed out that there is a sizeable group of researchers who would like to remain in science, have excellent methodological skills but don’t necessarily have the ambition/creativity/chutzpah/temerity to pursue a career as faculty member.

These researchers, was the thinking at our lunch table, are perfectly suited to conduct replication research. The field would benefit greatly from their work. If we truly want to make replication mainstream, there ought to be a career niche for them.

If faculty member is not a viable option, then what would be a good career niche for replicators? It was suggested at our table that replicators should become staff members, much like lab managers. They would not be evaluated on the originality or uniqueness of their publications. In fact, maybe they would not even be on the publications, just as lab managers often are not on publications. Faculty members select studies for replication and replicators conduct them and by doing so make a value contribution to our science.

I think this is a fair summary of our discussion. I have no strong opinions on this career niche for replicators yet but I wonder what ya'lls thoughts on this are.

----
* The link is to a paywalled article but I'm sure you can scihub your way to it.

Friday, February 2, 2018

How to Avoid More Train Wrecks

Update February 3: I added a Twitter response made by the first author. In the commentary section a comment by the second author.

I just submitted my review of the manuscript Experimental Design and the Reliability of Priming Effects: Reconsidering the "Train Wreck" by Rivers and Sherman. Here it is.

The authors start with two important observations. First, semantic priming experiments yield robust effects, whereas “social priming” (I’m following the authors’ convention of using quotation marks here) experiments do not. Second, semantic priming experiments use within-subjects designs, whereas “social priming” experiments use between-subjects designs. The authors are right in pointing out that this latter fact has not received sufficient attention.

The authors’ goal is to demonstrate that the second fact is the cause of the first. Here is how they summarize their results in the abstract: “These results indicate that the key difference between priming effects identified as more and less reliable is the type of experimental design used to demonstrate the effect, rather than the content domain in which the effect has been demonstrated.”

This is not what the results are telling us. What the authors have done, is to take existing well-designed experiments (not all of which are priming experiments by the way, as was already pointed out in the social media), and then demolish them to create, I’m sorry to say, more train wrecks of experiments in which only a single trial for each subject is retained. By thus getting rid of the vast majority of trials, the authors end up with an “experiment” that no one in their right mind would design. Unsurprisingly, they find that in each of the cases the effect is no longer significant.

Does this show that “the key difference between priming effects identified as more and less reliable is the type of experimental design used to demonstrate the effect”? Of course not. The authors imply that having a within-subjects design is sufficient for finding robust priming effects, of whatever kind. But they have not even demonstrated that a within-subjects design is necessary for priming effects to occur. For example, based on the data in this manuscript, it cannot be ruled out that a sufficiently powered between-subjects semantic priming effect would, in fact, yield a significant result. We already know from replication studies that between-subjects “social priming” experiments do not yield significant effects, even with large power.

More importantly, the crucial experiment that a within-subjects design is sufficient to yield “social priming” effects is absent from the paper. Without such an experiment, any claims about the design being the key difference between semantic and “social priming” are unsupported.

So where does this leave us? The authors have made an important initial step in identifying differences between semantic and “social priming” studies. However, to draw causal conclusions of the type the authors want to draw in this paper, two experiments are needed.

First, an appropriately powered single-trial between-subjects semantic priming experiment. To support the authors’ view, this experiment should yield a null result. This should of course be tested using the appropriate statistics. Rather than using response times the authors might consider using a word-stem completion task. Contrary to what the the authors would have to predict, I predict a significant effect here. If I’m correct, it would invalidate the authors’ claim about a causal relation between design and effect robustness.

Second, the authors should conduct a within-subjects “social priming” effect (that is close to the ones that they describe in the introduction). Whether or not this is possible, I cannot determine.

If the authors are willing to conduct these experiments--and omit the uninformative ones they report in the current manuscript—then they would make a truly major contribution to the literature. As it stands, they merely add more train wrecks to the literature. I therefore sincerely hope they are willing to undertake the necessary work.

Smaller points

p. 8. “In this approach, each participant is randomized to one level of the experimental design based on the first experimental trial to which they are exposed. The effect of priming is then analyzed using fully between-subjects tests.” But the order in which the stimuli were presented was randomized, right? So this means that this analysis actually compares different items. Given that there typically is variability in response times across items (see Herb Clark’s 1973 paper on the “language-as-fixed-effect fallacy”), this unnecessarily introduces noise into the analysis. Because there usually also is a serial position effect, this problem cannot be solved by taking the same item. One would have to take the same item in the same position. Therefore, it is impossible to take a single trial without losing experimental control over item and order effects. This is another reason why the “experiments” reported in this paper are uninformative.

p. 9. The Stroop task is not really a priming task, as the authors point out in a footnote. Why not use a real priming task?

p. 15. “It is not our intention to suggest that failures to replicate priming effects can be
solely attributed to research design.” Maybe not, but by stating that design is “the key difference,” the authors are claiming it has a causal role.

p. 16. “We anticipate that some critics will not be satisfied that we have examined ‘social
priming’.” I’m with the critics on this one.

p. 17. “We would note that there is nothing inherently “social” about either of these features of priming tasks. For example, it is not clear what is particularly “social” about walking down a hallway.” Agreed. Maybe call it behavioral priming then?

p. 18. “Unfortunately, it is not possible to ask subjects to walk down the same hallway 300 times after exposure to different primes.” Sure, but with a little flair, it should be possible to come up with a dependent measure that would allow for a within-subjects design.

p. 19. “We also hope that this research, for once and for all, eliminates content area as an explanation for the robustness of priming effects.” Without experiments such as the ones proposed in this review, this hope is futile.




Wednesday, January 31, 2018

A Replication with a Wrinkle

A number of years ago, my colleagues Peter Verkoeijen, Katinka Dijkstra, several undergraduate students, and I conducted a replication of Experiment 5 of Kidd & Castano (2013). In that study, published in Science, participants were exposed to an excerpt from either literary fiction or from non-literary fiction.

Kidd and Castano hypothesized that brief exposure to literary fiction as opposed to non-literary fiction would enhance empathy in people because of the greater psychological finesse in literary novels than in non-literary novels. Anyone who has read, say, Proust as well as Michael Crichton will probably intuit what Kidd and Castano were getting at.

Their results showed indeed that people who had been briefly exposed to the literary excerpt showed more empathy in Theory of Mind (ToM) tests than participants who had been briefly exposed to the non-literary excerpt.

Because the study touches on some of our own interests, text comprehension, literature, empathy and because of a number of reasons detailed in the article, we decided to replicate one of Kidd & Castano’s experiments, namely their Experiment 5. Unlike Kidd and Castano, we found no significant effect of text condition on ToM. We wrote that study up for publication in the Dutch journal De Psycholoog, a journal targeted at a broad audience of practitioners and scientists.

Because researchers from other countries kept asking us about the results of our replication attempt, we decided to make them more broadly available by writing an English version of the article with a more detailed methods and results section than was possible in the Dutch journal. This work was spearheaded by first author Iris van Kuijk, who was an undergraduate student when the study was conducted. A preprint of the article can be found here. An attentive reader who is familiar with the Dutch version and now reads the English version will be surprised. In the Dutch version the effect was not replicated but in the English version it was. What gives?

And this brings us to the wrinkle mentioned in the title. The experiment relies on subjects having read the excerpt. However, as any psychologist knows, there are always people who don’t follow instructions. To pinpoint such subjects and later exclude their data, it is useful to know whether they’ve actually read the texts. In both experiments, reading times per excerpt were collected.

We originally reasoned that it would be impossible for someone to read and understand a page in under 30 seconds. So we excluded subjects who had one or more reading times < 30 seconds per page. This ensured that our sample included subjects who had at least spent a reasonable amount of time on each excerpt. This would give the manipulation, reading a literary vs. non-literary excerpt optimum chance to work.

Upon reanalyzing the data for the English version, my co-authors noticed that Kidd and Castano had used a different criterion for excluding outliers. They had used a criterion that was less stringent than ours. They had excluded subjects whose average reading times were < 30 seconds. This potentially includes subjects who may have had long reading times for one page but may have skimmed another.

Our original approach ensured that people had at least spent a sufficient amount of time on each page. This still does not guarantee that they actually comprehended the excerpts, of course. For this, it would have been better to include comprehension questions, such that subjects with low comprehension scores could have been excluded, as is common in text comprehension research. 

Because we intended to conduct a direct replication, we decided to adopt the exclusion used by Kidd and Castano, even though we thought our own was better. And then something surprising happened: the effect appeared!

What to make of this? On the one hand, you could say that our direct replication reproduced the original effect (very closely indeed). On the other hand, we cannot come up with a theoretically sound reason why the effect would appear with a less-stringent exclusion criterion, which gives the manipulation less chance to impact ToM responses, and disappears with a more stringent criterion.

Nevertheless, if we want to be true to the doctrine of direct replication, which we do, then we should count this as a replication of the original effect but with a qualification. As we say in the paper:
“Taken together, it seems that replicating the results of Kidd and Castano (2013) hinges on choosing a particular set of exclusion criteria that a priori seem not better than alternatives. In fact, […] one could argue that a more stringent criterion regarding reading times (i.e., smaller than 30s per page rather than smaller than 30s per page on average) is to be preferred because participants who spent less than 30 seconds on a page did not adhere to the task instruction of reading the entire text carefully.”
The article also includes a mini meta-analysis of four studies, including the original study and our replication. The meta-analytic effect is not significant but there is significant heterogeneity among the studies.

In other words, there still are some wrinkles to be ironed out.


Tuesday, December 19, 2017

My Cattle

A while back, Lorne Campbell wrote a blog post  listing the preregistered publications from his lab. This is a great idea. It is easy to talk the talk, but it’s harder to walk the walk.

So under the notion that we don't want to be all hat and no cattle, I rounded up some replications and preregistered original papers that I co-authored.

First the replications.

I find performing replications very insightful. My role in two of the RRRs listed below (verbal overshadowing and facial feedback) was rather minor but the 2016 RRR and the issues surrounding it, on which I've blogged before, felt like an onslaught. The 2012 replication study was used to iron out an inconsistency in the literature. An additional replication study is close to getting accepted and will be added to the list in an update.

These days I use direct replications primarily when I want to build on work by others. As per Richard Feynman, before we move on we first need to attempt a direct replication of the effect we want to build on. We first need to know if we can reproduce it in our own lab.

Zwaan, R.A., Pecher, D. (2012). Revisiting Mental Simulation in Language Comprehension: Six Replication Attempts. PLoS ONE 7(12): e51382.

Alogna, V. K., Attaya, M. K., Aucoin, P., Bahnik, S., Birch, S., Birt, A. R., ... Zwaan, R. A. (2014). Registered replication report: Schooler & Engstler-Schooler (1990). Perspectives on Psychological Science, 9, 556–578.

Eerland, A., Sherrill, A.M., Magliano, J.P., Zwaan, R.A., Arnal, J.D., Aucoin, P., … Prenoveau, J.M. (2016). Registered replication report: Hart & AlbarracĂ­n (2011). Perspectives on Psychological Science, 11, 158-171. 

Wagenmakers, E.-J., Beek, T., Dijkhoff, L., Gronau, Q. F., Acosta, A., Adams, R. B., Jr., . . . Zwaan, R. A. (2016). Registered Replication Report: Strack, Martin, & Stepper (1988). Perspectives on Psychological Science, 11, 917–928.

Zwaan, R. A., Pecher, D., Paolacci, G., Bouwmeester, S., Verkoeijen, P., Dijkstra, K., & Zeelenberg, R. (in press). Participant nonnaiveté and the reproducibility of cognitive psychology. Psychonomic Bulletin & Review.

Next the original preregistered studies.

I started preregistering experiments several years ago. All in all, I find it an extremely important practice, quite possibly the most important thing we can do to improve the field. After a while preregistration becomes second nature and it becomes odd not to do it.

I have no experience yet with reviewed preregistrations (other than the three RRRs that I’ve participated in). My co-authors and I submitted one over three months ago but we haven’t gotten the reviews yet.

I should add, that I've co-authored quite a few additional empirical papers during this period that were not preregistered. This is mainly because the experiments in those papers were conducted years ago before preregistration was a thing.

Eerland, A., Sherrill, A.M., Magliano, J.P., Zwaan, R.A. (2017). The Blame Game: An investigation of Grammatical Aspect and Blame Judgments. Collabra: Psychology, 3(1): 29, 1–12.
         Only Experiments 3-5 were preregistered. Experiments 1&2 were conducted in 2012.

Eerland, A., Engelen, J.A.A., Zwaan, R.A. (2013). The influence of direct and indirect speech on mental representations. PloS ONE 8(6):  e65480.

Hoeben-Mannaert, L., Dijkstra, K., & Zwaan, R.A. (2017). Is color an integral part of a rich mental simulation? Memory & Cognition, 45, 974–982.

Pouw, W.J.T.L., van Gog, T., Zwaan, R.A., Agostinho. S., & Paas, F. (in press). Co-thought gestures in children’s mental problem solving: Prevalence and effects on subsequent performance. Applied Cognitive Psychology.

Sherrill A.M., Eerland A., Zwaan R.A., & Magliano J.P. (2015). Understanding how grammatical aspect influences legal judgment. PLoS ONE 10(10): e0141181.


And finally, to show that of course I also wear a stetson, here is a theoretical paper on replication. Yeehaw!

Zwaan, R.A., Etz, A., Lucas, R.E., & Donnellan, M.B. (in press). Making replication mainstream. Behavioral and Brain Sciences.


Thursday, December 7, 2017

The Long and Winding Road of our Latest Grammatical Aspect Article

A short blog post that strings together 8 tweets that I sent out today about our new paper.

Today our latest paper on grammatical aspect appeared in Collabra: Psychology. The article reflects the times we psychologists are living in. It does so not from the lofty perspective of the methodologist or statistician, but from the work floor on which the actual scientist (**ducks**) operates.

Our first two experiments were inspired by Hart & Albarricin (2011). This research itself was inspired by some of our own work but took it from cognition into the realm of social psychology, as I described in this blog post.

As the paper explains, these experiments were run in 2012, which is why they were not preregistered. Nobody was doing preregistration at the time. We were thinking to build on Hart and Albarricin (H&A) in what some would call a conceptual replication but which is better thought of as an extension.

For the life of us, we couldn’t get an effect like that of H&A. Then we got down to business and started a registered replication project in which we performed a direct replication of H&A. Along with 11 other labs, we found no effect.

We were sidetracked by the replication project. Especially because there were some troubling issues with the initial response to our RRR, as I describe here . We were sidetracked to the point that I’d completely forgotten about our 2012 experiments.

Luckily my co-authors had not and we decided to pick up the pieces of our study. It was clear that our research could no longer be driven by our H&A-inspired hypothesis, so we took a slightly different tack.

We conducted three more experiments, now all pre-registered, which yielded some interesting new findings, which you can read about in our paper. As usual per Collabra, the data are available and the reviews are open.