Wednesday, July 26, 2017

Defending .05: It’s Not Enough to be Suggestive

Today another guest post. In this post, Fernanda Ferreira and John Henderson respond to the recent and instantly (in)famous multi-authored proposal to lower the level of statistical significance to .005. If you want to discuss this post, Twitter is the medium for you. The authors' handles are @fernandaedi and @JhendersonIMB.


Fernanda Ferreira
John M. Henderson

Department of Psychology and Center for Mind and Brain
University of California, Davis


The paper “Redefine Statistical Significance” (henceforth, the “.005 paper”), written by a consortium of 72 authors, has already made quite a splash even though it has yet to appear in Nature Human Behavior. The call to a redefinition of statistical significance from .05 to .005 would have profound consequences across psychology, and it is not clear to us that the broad implications across the field have been thoroughly considered. As cognitive psychologists, we have major concerns about the advice and the rationale for this severe prescription.

In cognitive psychology we test theories motivated by a body of established findings, and the hypotheses we test are derived from those theories. It is therefore rarely true that any experimental outcome will be treated as equally likely. Our field is not effects-driven—we’re in the business of building and testing functional theories of the mind and brain, and effects are always connected back to those theories.

Standard practice in our subfield of psychology has always been based on replication. This has been extensively discussed in the literature and in social media, but it seems helpful to repeat the point: All of us were trained to design and conduct a theoretically motivated experiment, then design and conduct follow-ups that replicate and extend the theoretically important findings, often using converging operations to show that the patterns are robust across measures. This is why the stereotype emerged that cognitive psychology papers were typically three experiments and a model, where “model” is the subpart of the theory tested and elaborated in this piece of research.

Standard practice is also to motivate new research projects from theory and existing literature; the idea for a study doesn’t come out of the blue. And the first step when starting a new project is to make sure the finding or phenomenon to be built upon replicates. Then the investigator goes on to tweak it, play with it, push it, etc., all in response to refined hypotheses and predictions that fall out of the theory under investigation.*

Now, at this point, even if you agree with us, you might be thinking, “Well what would be the harm in going to a more conservative statistical criterion? Requiring .005 would only have benefits, because then we guard against Type I error and we avoid cluttering up the literature with non-results.” Unfortunately, as many have pointed out in informal discussions concerning the .005 paper, and as the .005 paper acknowledges as well, there are tradeoffs.

First, if you do research on captive undergraduates or you use M-Turk samples, then Ns in the hundreds might be no big deal. In the article, the authors estimate that a shift to .005 will necessitate at least a 70% increase in sample sizes, and they suggest this is not too high a price to pay. But setting aside the issue of non-convenience samples, this estimate is for simple effects, and we’re rarely looking for simple effects. In our business it’s all about statistical interactions, and for those, this recommendation can lead to much larger increases in sample size. And if your field requires you to test non-convenience samples such as heritage language learners, or people with any type of neuropsychological condition such as aphasia, or people with autism, dyslexia, or ADHD, or even just typically developing children, then these Ns might be unattainable. Testing such participants also requires trained, expensive staff. And yet the research might be theoretically and practically important. So if you work with these non-convenience samples, subject testing is costly. It probably requires real money to pay those subjects and the research assistants doing the testing, and the money is almost always going to come from research grants. And we all know what the situation is with respect to research funding—there’s very little of it. But even if you had the money, and you didn’t care that it came at the expense of the funding of maybe some other scientist’s project, where would you find the large numbers of people that this shift in alpha level would require? What this means in practice is that some potentially important research will not get done.

Let’s turn now to Type II error. The authors of the .005 piece, to their credit, discuss the tradeoff between settling for Type I versus Type II error, and they come down on the side that Type I is costlier. But this can’t be true as a blanket statement. Missing a potential effect because you’ve set the false positive rate so conservatively could have major implications for theory development as well as for practical interventions. A false positive is a thing that a researcher might follow up and discover to be illusory; but a false negative is not a thing and therefore is likely to be ignored and never followed up, which means that a potentially important discovery will be missed.

Some have noted that the negative reaction to the .005 article has been surprisingly strong. A response we’ve heard to the kinds of concerns we’ve expressed is that the advocates of the .005 paper are not urging .005 as a publication standard, but merely as the alpha level that permits the use of the word “significant” to describe results. However, it is easy to foresee a world in which (if these recommendations are adopted) editors and reviewers start demanding .005 for significance and use it as a publication standard. After all, the goal of the piece presumably isn’t just to fiddle with terminology.

We think the strong reaction against .005 is also in part because the nature of common practice in different areas of psychology are not well represented by those advocating for major changes to research practice like the .005 proposal. Relatedly, we think it’s unfortunate that, today, in the popular media, one frequently sees references to “the crisis in psychology”, when those of us inside psychology know that the entire field is not in crisis. The response from these advocates might be to say that we’re in denial, but we’re not – as we outlined earlier, the approach to theory building, testing, replication, and cumulative evidence that’s standard in cognitive psychology (and other subareas of psychology) makes it unlikely that a cute but illusory effect will survive.

So our frustration is real. We would like to see the conversation in psychology about scientific integrity broadened to include other subfields such as ours, and many others.

-----
*When we say these are standard practices in cognitive psychology, we don’t intend to imply that these practices are not standard in other areas; we’re simply talking about cognitive psychology because it’s the area with which we’re most familiar.