Tuesday, February 21, 2017

Replicating Effects by Duplicating Data

RetractionWatch recently reported on the retraction of a paper by William Hart. Richard Morey blogged in more detail about this case. According to the RetractionWatch report:

From this description I can only conclude that I am that “scientist outside the lab.” 

I’m writing this post to provide some context for the Hart retraction. For one, inconsistent is rather a euphemism for what transpired in what I’m about to describe. Second, this case did indeed involve a graduate student, whom I shall refer to as "Fox."

Back to the beginning. I was a co-author on a registered replication report (RRR) involving one of Hart’s experiments. I described this project in a previous post. The bottom line is that none of the experiments replicated the original finding and that there was no meta-analytic effect. 

Part of the RRR procedure is that original authors are invited to write a commentary on the replication report. The original commentary that was shared with the replicators had three authors: the two original authors (Hart and Albarricin) and Fox, who was the first author. A noteworthy aspect of the commentary was that it contained experiments. This was surprising (to put it mildly), given that one does not expect experiments in a commentary on a registered replication report, especially when these experiments themselves are not preregistered, as was the case here. Moreover, these experiments deviated from the protocol that we had established with the original authors. A clear case of double standards, in other words.

Also noteworthy was that the authors were able to replicate their own effect. And not surprising was that the commentary painted us as replication bullies. But with fake data, as it turns out.

The authors were made to upload their data to the Open Science Framework. I decided to take a look to see if I could explain the discrepancies between the successful replications in the commentary and all the unsuccessful ones in the RRR. I first tried to reproduce the descriptive and inferential statistics.  

Immediately I discovered some discrepancies between what was reported in the commentary and what was in the data file, both in condition means and in p-values. What could explain these discrepancies?

I decided to delve deeper and suddenly noticed a sequence of numbers, representing a subject’s responses, that was identical to a sequence several rows below. A coincidence, perhaps? I scrolled to the right where there was a column with verbal responses provided by the subjects, describing their thoughts about the purpose of the experiment. Like the number sequences, the two verbal responses were identical.

I then sorted the file by verbal responses. Lots of duplications started popping up. Here is a sample.

In all, there were 73 duplicates in the set of 194 subjects. This seemed quite alarming. After all, the experiment was run in the lab and how does one come to think they ran 73 more subjects than they actually ran? In the lab no less. It's a bit like running 25k and then saying afterwards "How bout them apples, I actually ran a marathon!" Also, given that the number of subjects was written out, it was clear that the authors intended to communicate they had a sample of 194 and not 121 subjects. Also important was that the key effect was no longer significant when the duplicates were removed (p=.059).

The editors communicated our concerns to the authors and pretty soon we received word that the authors had “worked night-and-day” to correct the errors. There was some urgency because the issue in which the RRR would appear was going to press.  We were reassured that the corrected data still showed the effect such that the conclusions of the commentary (“you guys are replication bullies”) remained unaltered and the commentary could be included in the issue.

Because I already knew that the key analysis was not significant after removal of the duplicates, I was curious how significance was reached in this new version. The authors had helpfully posted a “note on file replacement”: 

The first thing that struck me was that the note mentioned 69 duplicates whereas there were 73 in the original file. Also puzzling was the surprise appearance of 7 new subjects. I guess it pays to have a strong bullpen. With this new data collage, the p-value for the key effect was p=.028 (or .03).

A close comparison of the old and new data yields a different picture, though. The most important difference was that not 7 but 12 new subjects were added. In addition, for one duplicate both versions were removed. Renowned data sleuth Nick Brown analyzed these data separately from me and came up with the same numbers.

So history repeated itself here. The description of the data did not match the data and the “effect” was again significant just below .05 after the mixing-and-matching process.

There was much upheaval after this latest discovery, involving all of the authors of the replication project, the editors, and the commenters. I suspect that had we all been in the same room there would have been a brawl. 

The upshot of all this commotion was that this version of the commentary was withdrawn. The issue of Perspectives on Psychological Science went to press with the RRR but without the commentary.  In a subsequent issue, a commentary appeared with Hart as its sole author and without the new "data."

Who was responsible for this data debacle? After our discovery of the initial data duplication, we received an email from Fox stating that "Fox and Fox alone" was responsible for the mistakes. This sounded overly legalistic to me at the time and I’m still not sure what to make of it. 

The process of data manipulation described here appears to be one of mixing-and-matching. The sample is a collage consisting of data that can be added, deleted, and duplicated at will until a p-value of slightly below .05 (p = .03 seems popular in Hart’s papers) is reached.

I wonder if the data in the additional papers by Hart that apparently are going to be retracted are produced by the same foxy mixing-and-matching process. I hope the University of Alabama will publish the results of its investigation. The field needs openness.