Doorgaan naar hoofdcontent

Replicating Effects by Duplicating Data

RetractionWatch recently reported on the retraction of a paper by William Hart. Richard Morey blogged in more detail about this case. According to the RetractionWatch report:

From this description I can only conclude that I am that “scientist outside the lab.” 

I’m writing this post to provide some context for the Hart retraction. For one, inconsistent is rather a euphemism for what transpired in what I’m about to describe. Second, this case did indeed involve a graduate student, whom I shall refer to as "Fox."

Back to the beginning. I was a co-author on a registered replication report (RRR) involving one of Hart’s experiments. I described this project in a previous post. The bottom line is that none of the experiments replicated the original finding and that there was no meta-analytic effect. 

Part of the RRR procedure is that original authors are invited to write a commentary on the replication report. The original commentary that was shared with the replicators had three authors: the two original authors (Hart and Albarricin) and Fox, who was the first author. A noteworthy aspect of the commentary was that it contained experiments. This was surprising (to put it mildly), given that one does not expect experiments in a commentary on a registered replication report, especially when these experiments themselves are not preregistered, as was the case here. Moreover, these experiments deviated from the protocol that we had established with the original authors. A clear case of double standards, in other words.

Also noteworthy was that the authors were able to replicate their own effect. And not surprising was that the commentary painted us as replication bullies. But with fake data, as it turns out.

The authors were made to upload their data to the Open Science Framework. I decided to take a look to see if I could explain the discrepancies between the successful replications in the commentary and all the unsuccessful ones in the RRR. I first tried to reproduce the descriptive and inferential statistics.  

Immediately I discovered some discrepancies between what was reported in the commentary and what was in the data file, both in condition means and in p-values. What could explain these discrepancies?

I decided to delve deeper and suddenly noticed a sequence of numbers, representing a subject’s responses, that was identical to a sequence several rows below. A coincidence, perhaps? I scrolled to the right where there was a column with verbal responses provided by the subjects, describing their thoughts about the purpose of the experiment. Like the number sequences, the two verbal responses were identical.

I then sorted the file by verbal responses. Lots of duplications started popping up. Here is a sample.

In all, there were 73 duplicates in the set of 194 subjects. This seemed quite alarming. After all, the experiment was run in the lab and how does one come to think they ran 73 more subjects than they actually ran? In the lab no less. It's a bit like running 25k and then saying afterwards "How bout them apples, I actually ran a marathon!" Also, given that the number of subjects was written out, it was clear that the authors intended to communicate they had a sample of 194 and not 121 subjects. Also important was that the key effect was no longer significant when the duplicates were removed (p=.059).

The editors communicated our concerns to the authors and pretty soon we received word that the authors had “worked night-and-day” to correct the errors. There was some urgency because the issue in which the RRR would appear was going to press.  We were reassured that the corrected data still showed the effect such that the conclusions of the commentary (“you guys are replication bullies”) remained unaltered and the commentary could be included in the issue.

Because I already knew that the key analysis was not significant after removal of the duplicates, I was curious how significance was reached in this new version. The authors had helpfully posted a “note on file replacement”: 

The first thing that struck me was that the note mentioned 69 duplicates whereas there were 73 in the original file. Also puzzling was the surprise appearance of 7 new subjects. I guess it pays to have a strong bullpen. With this new data collage, the p-value for the key effect was p=.028 (or .03).

A close comparison of the old and new data yields a different picture, though. The most important difference was that not 7 but 12 new subjects were added. In addition, for one duplicate both versions were removed. Renowned data sleuth Nick Brown analyzed these data separately from me and came up with the same numbers.

So history repeated itself here. The description of the data did not match the data and the “effect” was again significant just below .05 after the mixing-and-matching process.

There was much upheaval after this latest discovery, involving all of the authors of the replication project, the editors, and the commenters. I suspect that had we all been in the same room there would have been a brawl. 

The upshot of all this commotion was that this version of the commentary was withdrawn. The issue of Perspectives on Psychological Science went to press with the RRR but without the commentary.  In a subsequent issue, a commentary appeared with Hart as its sole author and without the new "data."

Who was responsible for this data debacle? After our discovery of the initial data duplication, we received an email from Fox stating that "Fox and Fox alone" was responsible for the mistakes. This sounded overly legalistic to me at the time and I’m still not sure what to make of it. 

The process of data manipulation described here appears to be one of mixing-and-matching. The sample is a collage consisting of data that can be added, deleted, and duplicated at will until a p-value of slightly below .05 (p = .03 seems popular in Hart’s papers) is reached.

I wonder if the data in the additional papers by Hart that apparently are going to be retracted are produced by the same foxy mixing-and-matching process. I hope the University of Alabama will publish the results of its investigation. The field needs openness.


  1. Thanks for this excellent post and thanks for uncovering this blatant case of fraud!

    "Fox" has evidently had a long and fraudulent career. According to Hart, it was Fox who faked the data for a 2013 paper (sole author Hart) which is now going to be retracted. But that paper was submitted on 22nd December 2011. Hmm.

    1. Yes, that's one of the things that's odd about this situation. I wonder if Fox started out as an undergraduate student, was discovered to have "flair" and was then recruited into the grad program. This is all speculation on my part, though.

    2. Has it been ruled out that Fox is not just taking the blame? If you know the identify of this person and if you can find a CV online, you should be able to find out if Fox was indeed a student at the same university in 2011. No acknowledgements were made to anybody else for data collection help either. At the very least, Hart is unethical in not acknowledging the contributions of students to his papers.

      It is also amazing that the methods section in that paper (linked to by Neuroskeptic) doesn't mention what university the participants came from or what IRB approved the research.

      Excerpts from the paper:
      "To measure happiness, I had participants rate their current level of happiness, using a scale from 0 (unhappy) to 10 (happy), and their satisfaction with life, using a scale from 0 (unsatisfied) to 10 (satisfied; Strack et al., 1985); participants were told that these ratings were of interest to a university panel."

      "Subsequently, I told participants that I was interested in their experience during the task."

      So I don't get how Fox and Fox alone could be responsible for any and all mistakes.

    3. I'm not sure it has been ruled out that he's just taking the blame. As I said, I really don't know what to make of it.

      Yes, I noticed the lack of specificity as well. In the original commentary we were blamed for not having realized we should have run politically conservative subjects. As with the article you're referring to, the article we were dealing with didn't even specify where the subjects were from.


Een reactie posten