tag:blogger.com,1999:blog-6322739827777311964.post4845516177880358791..comments2019-06-05T10:06:00.279+02:00Comments on Rolf Zwaan: Time, Money, and MoralityRolf Zwaanhttp://www.blogger.com/profile/07617143491249303266noreply@blogger.comBlogger10125tag:blogger.com,1999:blog-6322739827777311964.post-40268261060603613282013-12-18T15:22:55.108+01:002013-12-18T15:22:55.108+01:00I think it is good that this kind of analysis is b...I think it is good that this kind of analysis is being performed and shared in a public place. I wanted to consider some details of the analysis and an alternative approach.<br /><br />Rolf focused on an effect that was repeatedly found across four experiments in Gino and Mogilner (2013): that participants were less likely to cheat when focused on time compared to participants in a control or a money-focused condition. <br /><br /> These are not the only reasonable choices. Gino and Mogilner (2013) also explored the effect of a money-focus for a variety of main effects and contrasts. The p-values that are produced by these different hypothesis tests are not independent of the p-values analyzed by Rolf, and such dependencies mean that it is not appropriate to include them all in the p-curve analysis. Table 1 (http://www1.psych.purdue.edu/~gfrancis/Publications/TimeMoneyMorality/Table1.pdf ) highlights the statistics used for different analyses. The p-curve analysis for the money effect does not indicate p-hacking (p=0.76). Details of the analysis are in Figure 1 (http://www1.psych.purdue.edu/~gfrancis/Publications/TimeMoneyMorality/Figure1.pdf ) These two different conclusions are not in conflict because the statistics measure different effects. Nevertheless, concluding p-hacking from the p-curve analysis depends on which statistics are analyzed. Importantly, the p-curve analysis cannot consider both sets of statistics simultaneously because of the dependencies. <br /><br />Table 2 (http://www1.psych.purdue.edu/~gfrancis/Publications/TimeMoneyMorality/Table2.pdf ) shows the post hoc power for each experiment. Consider the column for the time-focused statistics. The power estimates are all just above one half. The Test for Excess Significance (TES) notes that the probability of all four experiments like these rejecting the null hypothesis is the product of the power values. The final row indicates that this probability is 0.079. This probability can be considered an estimate of the probability that a direct replication of the four experiments (with the same sample sizes) would all produce statistically significant outcomes. Since this probability is less than the 0.1 criterion that is commonly used for these kinds of analyses, readers should be skeptical that the reported results were produced with proper experiments and analyses. <br /><br />The money-focused power values are higher, and their product is well above the 0.1 criterion. In this respect, the TES analysis gives essentially the same conclusions as the p-curve analysis. <br /><br />The final column in Table 2 consider a more general TES analysis that considers the money-focused, the time-focused, and additional statistical results (highlighted in yellow in Table 1) that were deemed by Gino and Mogilner (2013) as providing support for their theoretical ideas. The success probability for the full set was estimated with simulated experiments that used the properties of the reported sample statistics. The 0.003 probability is so small that it is difficult to suppose that the experiments were fully reported, properly run, and properly analyzed. <br /><br />This result does not mean that there is no merit to the reported results, but it means that readers should be skeptical about the theoretical conclusions that are derived from the reported results. Moreover, it is not obvious which effects can be believed and which are suspect. <br /><br />Unlike the p-curve analysis, the TES can consider the full set of experimental results used by Gino and Mogilner (2013) to support their theoretical ideas. Applying this more general approach leads to a pretty convincing conclusion that readers should doubt the validity of the relationship between the experimental data and the theoretical claims. <br /><br />A spreadsheet describing the effect size and power estimates, along with R code for the estimating power, can be downloaded from<br />http://www1.psych.purdue.edu/~gfrancis/Publications/TimeMoneyMorality/<br />Greg Francishttps://www.blogger.com/profile/04587796537062283918noreply@blogger.comtag:blogger.com,1999:blog-6322739827777311964.post-46804209183978680652013-12-12T17:15:23.563+01:002013-12-12T17:15:23.563+01:00It would indeed be best to run replications. Howev...It would indeed be best to run replications. However, as someone mentioned to me in an email yesterday, you cannot possibly refute questionable studies given the rate at which they are published. Experiment 4, however, was run on MTurk and so would be a good candidate for a replication. No worries about special booths or experimenters, etc.Rolf Zwaanhttps://www.blogger.com/profile/07617143491249303266noreply@blogger.comtag:blogger.com,1999:blog-6322739827777311964.post-65617015523516959982013-12-12T17:07:57.047+01:002013-12-12T17:07:57.047+01:00I agree Chris. I'm sensitive to this issue as ...I agree Chris. I'm sensitive to this issue as we had to deal with it when I served on the Smeesters Committee. In this particular case, others had expressed skepticism about this study on Twitter, which I shared when I read the paper. I took a closer look and then I noticed the issue with the p-values. So here there was an a priori hypothesis, so to speak.Rolf Zwaanhttps://www.blogger.com/profile/07617143491249303266noreply@blogger.comtag:blogger.com,1999:blog-6322739827777311964.post-4455177950553398202013-12-12T15:40:17.571+01:002013-12-12T15:40:17.571+01:00I think the issue is that the p-values are the clu...I think the issue is that the p-values are the clues to p-hacking. Basically, I think a collection of studies with p values just below .05, fluctuating ns without explanation, and weird effect sizes (i.e., large relative to expectations) are clues to p-hacking. I hate seeing packages dotted with p values around .04.<br /><br />The solution for p-hacking is fairly simple. Run the studies again under the same conditions (preferably with larger samples to get more precise estimates). If the results hold, the field has increased confidence in the sturdiness of the findings. If the results don’t duplicate, we learn another painful lessons about the impact of chance and the downsides of QRPs. <br />Brent Donnellanhttps://www.blogger.com/profile/07461139275622668457noreply@blogger.comtag:blogger.com,1999:blog-6322739827777311964.post-49147472431054976712013-12-12T14:58:43.395+01:002013-12-12T14:58:43.395+01:00I agree that this paper appears to have the hallma...I agree that this paper appears to have the hallmarks of p-hacking. But I think we need some caution if want to engage in post-hoc p-hackery analyses. It's one thing to state a priori "I think a set of studies that have this feature may show p-hackery" versus looking at the p values first and then post-hoc look for evidence of p-hackery. Perhaps in the near future researchers interested in p-hacking will develop post-hoc corrections for p-hack investigations. Schotzhttps://www.blogger.com/profile/02042026526736437860noreply@blogger.comtag:blogger.com,1999:blog-6322739827777311964.post-43502249703735209402013-12-11T20:00:01.617+01:002013-12-11T20:00:01.617+01:00It seems to me that, to paraphrase the British pol...It seems to me that, to paraphrase the British politician Peter Mandelson, social and positive psychologists are "intensely relaxed" about the possibility of Type 1 error. In fact, I suspect that many of them don't sincerely consider Type 1 error to be, as the kids on the Internet say, "a thing". I found it, I published it, nobody has taken the time and effort to jump through the many hoops (some of them flaming) needed to refute it, therefore I win.<br /><br />I think that the people who pay for all this (i.e., the taxpayers in most cases) would be appalled to discover just how little understanding very many scientists have of the appropriate use of the most basic tools of their trade. Perhaps this applies "especially" to psychologists when it comes to p-hacking, although abjectly bad statistical practice seems to be common in almost every discipline.Nick Brownhttps://www.blogger.com/profile/18266307287741345798noreply@blogger.comtag:blogger.com,1999:blog-6322739827777311964.post-15179529555924603582013-12-11T14:56:22.780+01:002013-12-11T14:56:22.780+01:00Very important points that have certainly changed ...Very important points that have certainly changed my outlook on things. I wonder how hard it would be to p-hack your way to a p-value of <.005. Rolf Zwaanhttps://www.blogger.com/profile/07617143491249303266noreply@blogger.comtag:blogger.com,1999:blog-6322739827777311964.post-12472451557250895462013-12-11T14:52:09.667+01:002013-12-11T14:52:09.667+01:00Thanks, this is a good read.Thanks, this is a good read.Rolf Zwaanhttps://www.blogger.com/profile/07617143491249303266noreply@blogger.comtag:blogger.com,1999:blog-6322739827777311964.post-67302651623652251472013-12-10T18:08:09.337+01:002013-12-10T18:08:09.337+01:00A Bayes factor analysis shows that these kind of p...A Bayes factor analysis shows that these kind of p-values (close to the .05 boundary) have almost no evidential impact. This goes back to Edwards, Lindman, & Savage, 1963 Psych Review, and has recently been demonstrated again by Jim Berger, and, in 2013, by Valen Johnson ("Revised Standards for Statistical Evidence"). Johnson ends up recommending an alpha-level of .005. As Lindley remarked: “There is therefore a serious and systematic difference between the Bayesian and Fisherian calculations, in the sense that a Fisherian approach much more easily casts doubt on the null value than does Bayes. Perhaps this is why significance tests are so popular with scientists: they make effects appear so easily.” EJhttps://www.blogger.com/profile/14128820127410812375noreply@blogger.comtag:blogger.com,1999:blog-6322739827777311964.post-40955484740349785862013-12-10T17:52:01.671+01:002013-12-10T17:52:01.671+01:00Here is your answer: "no matter how one choos...Here is your answer: "no matter how one chooses the [N and the true effect size] under the alternatives, at most 3.7% of the p values will fall in the interval (.04; .05)". http://www.stat.duke.edu/courses/Spring10/sta122/Labs/Lab6.pdf. <br />Having four of those in a row is pretty unlikely!Jelte M. Wichertshttps://www.blogger.com/profile/05639904799598796192noreply@blogger.com