Tuesday, January 1, 2013

Coming out of the file drawer

My previous post was about the why of replication studies. This one is about my first foray into the replication business. That is, my first venture outside the file drawer (where several nonreplications of other people’s work reside, as well as nonreplications of studies of my own that were never submitted because we were unable to replicate the initial finding). I’m coming out of the file drawer, so to speak.

I’m not going to discuss the contents of the study here. I’m just going to talk about a couple of things my co-author, Diane Pecher, and I learned from our replication efforts.

I’ve got the power!
Psychology experiments are chronically underpowered. Simmons, Nelson, and Simonsohn suggest you need at least 20 subjects per condition, which is more than many psychology experiments have. At a recent symposium, a statistician even said that to be informative, experiments should have at least 100 subjects; otherwise they are merely exploratory (I’m paraphrasing). I have heard people scoff at these suggestions (they may not be feasible for studies using special populations and not necessary for psychophysics experiments) but whatever the right number is, it is true that Ns are too small in the vast majority of psychology experiments, including my own.

Running 100+ subjects is difficult to accomplish in many labs given the size of the subject pools and the availability of lab space. My guess is that it would take the better part of a year to run a study of that size in our lab. But no need to worry; there are alternatives. We ran our experiments on Mechanical Turk, a database maintained by Amazon. Turkers participate for small amounts of money in HITs (Human Intelligence Tasks). Thousands of people in the United States and India are registered in the system. We limited our samples to people living in the United States, primarily because two of the experiments we were trying to replicate were run in the United States (the other two were run in England).

Keeping false positives at Bayes
With a large N, the likelihood of false positives is high in classic Null-hypothesis significance tests. An inconsequential difference might show up as significant. An alternative is to compute the Bayes factor, which is a likelihood ratio that allows you to assess the strength of the alternative hypothesis versus the Null hypothesis or the other way around. To be conclusive, the Bayes factor requires more evidence for the alternative hypothesis with larger samples than does for example a t-test but it also allows you to determine whether a small effect is consequential.  Bayes factors can easily be computed using Jeffrey Rouder’s web site at the University of Missouri. You just put in a t-value and the sample size and it will return the Bayes factor—actually three of them; we used the JZS Bayes factor.

Unlike standard hypothesis-testing statistics, Baysian statistics don’t force you to define your sampling plan ahead of time. According to a very insightful paper by Wagenmakers and colleagues—in a must-read special issue of Perspectives on Psychological Science—you can continue collecting data until the Bayes factor seems to stabilize (I must admit the article is a bit hazy on this part, or maybe I am). In our case it meant that we could compute a combined Bayes factor over two experiments that were essentially identical, which gave us even more power. This move was suggested to us by Eric-Jan Wagenmakers, an expert in Bayesian statistics (which I am most definitely not).

Two heads are better than one
Armed with our large samples and Bayes factors, we were ready to analyze the data. And here we did something that I think is highly unusual in psychological research. We each performed our own analysis of the data and then compared our results. We were humbled to see that on several occasions we didn’t get the same outcome. True, we weren’t far apart and the differences were inconsequential and easy to resolve, but it taught us a good lesson. It is important to have multiple people analyze the data—an error is easily made (my bet is that the literature is replete with them). The files that I created to analyze the data (which include the raw data) can be found here.

Taking the experimenter and the lab out of the loop
One big advantage of on-line experiments is that there is no experimenter involved, so there cannot be any experimenter effects. Whatever results you obtain, they cannot be caused by the professional demeanor, friendly attitude, white lab coat, or short skirt of your research assistant.

There is another advantage. Turkers don’t go to the lab to participate in experiments. They might be at home on the couch, in the office pretending to do their regular job, on the train, in the airport, or in a coffee shop (though preferably not in what the Dutch call a coffee shop). We ask subjects about their environment and the noise level in it, and they generally tell us that they work in quiet environments. We tend to believe them because they are highly conscientious subjects. They often provide thoughtful feedback on our experiments.

But how can lack of control be an advantage? It is an advantage in terms of reproducibility. Evidently, results like ours were not caused by the academic setting of the experiment, the color of the walls in the experiment room, the close confines of a cubicle (though some Turkers probably operate from cubicles), the red light on the door of the experiment room, and so on.

This means that replication attempts of on-line studies are relatively straightforward. For example, if anyone wanted to replicate us, they can get our data-collection programs (contact me, as I still need to post them online), create a link to them on Mechanical Turk, and with a couple hundred dollars, you’re in business. You will have your data within a day or so.

But what did you find and what does it mean?

More about this in the next blog.


  1. Great start to the blog -- the metaphors are great!

  2. Thanks for the link to the Bayesian calculator. Are there guidelines yet for what Bayesian ratios we should expect in most psychology experiments?

    I'm running a study now involving motor priming and the initial results look good from the old school perspective: p = 0.05, Cohen's d = .55, N=50, between subjects design. But I plug the numbers into the Bayesian calculator and the ratio is only 0.28, far less than the ratio you cited in your Turk replication.

    The 95% CIs for each condition still overlap, to the extent that I imagine I'd need at least 50 more subjects to show a clear separation, even though the p-value could be driven well below .05 much sooner than the CIs diverge.

    These are the cleanest results I've ever obtained in a first attempt at a new concept. I'm not sure my study could be run through Turk at all, and I'm pretty sure I couldn't access participants with the required demographic backgrounds. I have a large subject pool, so I can add 50 more in February/March without difficulty, but as you indicate, I think most researchers face severe limitations with studies that can't be executed online. Effect sizes in social/cognitive psychology aren't usually high enough to be captured reliability with small samples (not that ES depends on N mathematically, but that there's a relationship between p values, sample sizes, and effect sizes in conducting studies).

    And through all of this, I'm not sure I even understand how p values and Bayesian ratios systematically relate to each other, if they do. My impression is that Bayesian analysis tests the specific alternative that's being used in the study, as opposed to merely indicating the probablity of getting these results when the conditional populations are theoretically identical. It shifts our statistical attention toward the likelihood of the alternative hypothesis rather than the probability of our data existing in a null world? This distinction seems so subtle if you're used to (mis-)using p values as statements about your hypothesis.

    (Why the hell is my ID here listed as Dr M.?? This is Mark, btw).

  3. Let me preface this by repeating I'm not at all an expert in Bayesian statistics. However, I believe this article, http://pps.sagepub.com/content/6/3/291.full.pdf., might be helpful to you. It addresses a lot of your questions, including the relation between p-values and Bayes factors and how to interpret the latter. I believe yours is already nothing to sneeze at. http://pps.sagepub.com/content/6/3/291.full.pdf.

    Thanks for reading the blog!

  4. Ah but what I wanted to ask is whether the new standards could put a lot of social/cog questions out of reach for many people in these fields using p-values now as sufficient evidence. How many people are getting effects that would stand up to Bayesian analysis? Or non-overlapping CIs? The standards for publication in regard to effects must, in a way, come down if the standards for analysis go up. A theoretically sound design should be publishable regardless of the results as long as the results are informative.

    It's remarkable how often I've found people talking about non-replications in their labs (usually of other people's work) as if those efforts provided valuable information yet journals look with deep suspicion on null results. I asked one editor about this and he replied that there could be many reasons for not seeing a difference between conditions! Annnnnnnnd?

    I think we must also abandon the Platonic mode of writing and publishing in which studies are reported in ways that fit schematic views of scientific research. At SPSP a couple years ago the editor of one of the top social journals was asked whether he liked to see a chronology of the work with various hypotheses considered or a clean story, and he dismissed chronologies as mere historical records. I kind of snorted and looked around to expect similar reactions but if anyone else had a problem with his response, I didn't see it. If we constantly reframe the research process in publications with a fictional narrative that obscures the exploratory nature of much psychological research, the clean story that makes for easy bedtime reading only fuels the conceptual and statistical misunderstandings that have made so many "findings" in print highly doubtful now in retrospect. Not every notion and every lab debate needs to be reported but somehow our publications should reflect the actual process more than they obscure it. This would reflect the humility of science so much discussed in introductory textbooks.

  5. I don't think it's an either-or-situation. If the standards for analysis go up, the standards for publication in regard to effects don't necessarily have to go down. Maybe the publication pressure should go down instead so that researchers can take more time to run studies. However, I'm the first one to admit that I'm a very impatient guy, so I'm glad that I can run online experiments.

    Everybody always says it's hard to get non-replications published but I wonder how much of this is self-handicapping. In my three years as Editor-in-Chief, I've handled well over a 1000 manuscripts. Practically none of them featured non-replications.

    There should be a place for both chronologically accurate and plot-based accounts. Aristotle (if I may throw a Greek philosopher back at you) already argued that historians should use the former and dramatists and epicists the latter. Maybe the chronological accounts should go into archival journals and maybe the plot-based ones in blogs. The former would be part of the scientific record and the latter would be ways to inform a broader audience.

    In my next post, I'll describe a chronologically-based 14-experiment Behemoth we are writing at the moment.