Saturday, March 4, 2017

The value of experience in criticizing research

It's becoming a trend: another guest blog post. This time, J.P. de Ruiter shares his view, which I happen to share, on the value of experience in criticizing research.


J.P. de Ruiter
Tufts University

One of the reasons that the scientific method was such a brilliant idea is that it has criticism built into the process. We don’t believe something on the basis of authority, but we need to be convinced by relevant data and sound arguments, and if we think that either the data or the argument is flawed, we say this. Before a study is conducted, this criticism is usually provided by colleagues, or in case of preregistration, reviewers. After a study is submitted, critical evaluations are performed by reviewers and editors. But even after publication, the criticism continues, in the form of discussions in follow-up articles, at conferences, and/or on social media. This self-corrective aspect of science is essential, hence criticism, even though at times it can be difficult to swallow (we are all human) is a very good thing. 

We often think of criticism as pointing out flaws in the data collection, statistical analyses, and argumentation of a study. In methods education, we train our students to become aware of the pitfalls of research. We teach them about assumptions, significance, power, interpretation of data, experimenter expectancy effects, Bonferroni corrections, optional stopping, etc. etc. This type of training leads young researchers to become very adept at finding flaws in studies, and that is a valuable skill to have.  

While I appreciate that noticing and formulating the flaws and weaknesses in other people’s studies is a necessary skill for becoming a good critic (or reviewer), it is in my view not sufficient. It is very easy to find flaws in any study, no matter how well it is done. We can always point out alternative explanations for the findings, note that the data sample was not representative, or state that the study needs more power. Always. So pointing out why a study is not perfect is not enough: good criticism takes into account that research always involves a trade-off between validity and practicality. 

As a hypothetical example: if we review a study about a relatively rare type of Aphasia, and notice that the authors have studied 7 patients, we could point out that a) in order to generalize their findings, they need inferential statistics, and b) in order to do that, given the estimated effect size at hand, they’d need at least 80 patients. We could, but we probably wouldn’t, because we would realize that it was probably hard enough to find 7 patients with this affliction to begin with, so finding 80 is probably impossible. So then we’d probably focus on other aspects of the study. We of course do keep in mind that we can’t generalize over the results in the study with the same level of confidence as in a lexical decision experiment with a within-subject design and 120 participants. But we are not going to say, “This study sucks because it had low power”. At least, I want to defend the opinion here that we shouldn’t say that. 

While this is a rather extreme example, I believe that this principle should be applied at all levels and aspects of criticism. I remember that as a grad student, a local statistics hero informed me that my statistical design was flawed, and proceeded to require an ANOVA that was way beyond the computational capabilities of even the most powerful supercomputers available at the time. We know that full LMM models with random slopes and intercepts often do not converge. We know that many Bayesian analyses are intractable. In experimental designs, one runs into practical constraints as well. Many independent variables simply can’t be studied in a within-subject design. Phenomena that only occur spontaneously (e.g. iconic gestures) cannot be fully controlled. In EEG studies, it is not feasible to control for artifacts due to muscle activity, hence studying speech production is not really possible with this paradigm.

My point is: good research is always a compromise between experimental rigor, practical feasibility, and ethical considerations. To be able to appreciate this as a critic, it really helps to have been actively involved in research projects. Not only because that gives us more appreciation of the trade-offs involved, but also, perhaps more importantly, of the experience of really wanting to discover, prove, or demonstrate something. It makes us experience first-hand how tempting it can be, in Feynman’s famous formulation, to fool ourselves. I do not mean to say that we should become less critical, but rather that we become better constructive critics if we are able to empathize with the researcher’s goals and constraints. Nor do I want to say that criticism by those who have not yet have had positive research experience is to be taken less seriously. All I want to say here is that (and why) having been actively involved in the process of contributing new knowledge to science makes us better critics. 

Thursday, March 2, 2017

Duplicating Data: The View Before Hindsight

Today a first in this blog: a guest post! In this post Alexa Tullett reflects on the consequences of Fox's data manipulation, which I described in the previous post, for her own research and that of her collaborator, Will Hart.


Alexa Tullett
University of Alabama

[Disclaimer: The opinions expressed in this post are my own and not the views of my employer]

When I read Rolf’s previous post about the verb aspect RRR I resonated with much of what he said. I have been in Rolf’s position before as an outside observer of scientific fraud, and I have a lot of admiration for his work in exposing what happened here.  In this case, I’m not an outside observer. Although I was not involved with the RRR that Rolf describes in detail, I was a collaborator of Fox’s (I’ll keep up the pseudonym) and my name is on papers that have been, or are in the process of being retracted. I also continue to be a collaborator of Will Hart’s, and hope to be for a long time to come. Rolf has been kind enough to allow me space here to provide my perspective on what I know of the RRR and the surrounding events. My account is colored by my personal relationships with the people involved, and while this unquestionably undermines my ability to be objective, perhaps it also offers a perspective that a completely detached account cannot.

I first became involved in these events after Rolf requested that Will re-examine the data from his commentary for the RRR. Will was of the mind that data speak louder than words, so when the RRR did not replicate his original study he asked Fox to coordinate data collection for an additional replication. Fox was not an author on the original paper, and was not told the purpose of the replication. Fox ran the replication, sent the results to Will, and Will sent those and his commentary to Rolf. Will told me that he had reacted defensively to Rolf’s concerns about these data, but eventually Will started to have his own doubts. These doubts deepened when Will asked Fox for the raw data and Fox said he had deleted the online studies from Qualtrics because of “confidentiality” issues. After a week or two of communicating with the people at Qualtrics Will was able to obtain the raw data, and at this point he asked me if I would be willing to compare this with the “cleaned” data he had sent to Perspectives.

I will try to be as transparent as possible in documenting my thought process at the time these events unfolded. It’s easy to forget – or never consider – this na├»ve perspective once fraud becomes uncontested. When I first started to look at the data, I was far from the point where I seriously entertained the possibility that Fox tampered with the data. I thought scientific fraud was extremely rare. Fox was, in my mind, a generally dependable and well-meaning graduate student. Maybe he had been careless with these data, but it seemed far-fetched to me that he had intentionally changed or manipulated them.

I started by looking for duplicates, because this was the concern that Will had passed along from Rolf. They weren’t immediately obvious to me, because the participant numbers (the only unique identifiers) had been deleted by Fox. But, when I sorted by free-response answers several duplicates became apparent, as one can see in Rolf’s screenshot. There were more duplicates as well, but they were harder to identify for participants who hadn’t given free-response answers. I had to find these duplicates based on patterns of Likert-scale answers. I considered how this might have happened, and thought that perhaps Fox had accidentally downloaded the same condition twice, rather than downloading the two conditions. As I looked at these data further I realized that there had also been deletions. I speculated that Fox had been sloppy when copying and pasting between datasets – maybe some combination of removing outliers without documenting them and accidentally repeatedly copying cases from the same dataset.

I only started to genuinely question Fox’s intentions when I ran the key analysis on the duplicated and deleted cases and tested the interaction. Sure enough, the effect was there in the duplicated cases, and absent in the deleted cases. This may seem like damning evidence, but to be honest I still hadn’t given up on the idea that this might have happened by accident. Concluding that this was fraud felt like buying into a conspiracy theory. I only became convinced when Fox eventually admitted that he had done this knowingly. And had done the same thing with many other datasets that were the foundation of several published papers—including some on which I am an author.

Fox confessed to doing this on his own, without the knowledge of Will, other graduate students, or collaborators. Since then, a full investigation by UA’s IRB has drawn the same conclusion. We were asked not to talk about these events until that investigation was complete.

Hindsight’s a bitch. My thinking prior to Fox’s confession seems as absurd to me as it probably does to you. How could I have been so naively reluctant to consider fraud? How could I have missed duplicates in datasets that I handled directly?  I think part of the answer is that when we get a dataset from a student or a collaborator, we assume that those data are genuine. Signs of fraud are more obvious when you are looking for them. I wish we had treated our data with the skepticism of someone who was trying to determine whether they were fabricated, but instead we looked at them with the uncritical eye of scientists whose hypotheses were supported.

Fox came to me to apologize after he admitted to the fabrication. He described how and why he started tampering with data. The first time it happened he had analyzed a dataset and the results were just shy of significance. Fox noticed that if he duplicated a couple of cases and deleted a couple of cases, he could shift the p-value to below .05. And so he did. Fox recognized that the system rewarded him, and his collaborators, not for interesting research questions, or sound methodology, but for significant results. When he showed his collaborators the findings they were happy with them—and happy with Fox.

The silver lining. I’d like to think I’ve learned something from this experience. For one thing, the temptation to manipulate and fake data, especially for junior researchers, has become much more visible to me. This has made me at once more understanding and more cynical. Fox convinced himself that his research was so trivial that faking data would be inconsequential, and so he allowed his degree and C.V. to take priority. Other researchers have told me it’s not hard to relate. Now that I have seen and can appreciate these pressures, I have become more cynical about the prevalence of fraud.

My disillusionment is at least partially curbed by the increased emphasis on replicability and transparency that has occurred in our field over the past 5 years. Things have changed in ways that make it much more difficult to get away with fabrication and fraud. Without policies requiring open data, this case and others like it would often go undiscovered. Even more encouragingly, things have changed in ways that begin to alter the incentive structures that made Fox’s behavior (temporarily) rewarding. More and more journals are adopting registered report formats where researchers can submit a study proposal for evaluation and know that, if they faithfully execute that study, it will get published regardless of outcome. In other words, they will have the freedom to be un-invested in how their study turns out.