Friday, June 21, 2013

The Tyrion Lannister Paradox: How Small Effect Sizes can be Important


There has been a lot of debate lately about effect sizes. On the one hand, there are effects in the social priming literature that seem surprisingly large given the subtlety of the manipulation, the between-subjects design, and the (small) sample size. On the other hand, some researchers can be heard complaining about small effect sizes in other areas of study (for example cognitive psychology). Why would we want to study small effects?

This is not a new question.  We could go further back in history but let’s stop in 1992, the year in which an insightful article on small effect sizes appeared, authored by Deborah Prentice and Dale Miller. Prentice and Miller argue that there are two valid reasons why psychologists study small effects.

The first reason is that researchers are trying to establish the minimal conditions under which an effect can be found. They accomplish this by minimally manipulating the independent variable.  The second reason is that researchers are invested in showing that an effect occurs even under very unfavorable conditions.

According to this analysis, there are two modes of experimentation. One is targeted at accounting for maximal variance and is therefore interested in big effects. And the other is aiming to provide the most stringent tests of a hypothesis. 

So researchers who study small effects generally (generally being the operative word here) aren’t doing this because they enjoy being obscure, esoteric, fanciful, eccentric, absurd, ludicrous, kooky, or wacky. They are simply trying to be good scientists. An experiment might look farfetched but this doesn’t mean it is. It might very well be the product of rigorous scientific thought.

If we accept experiments with small effect sizes as scientifically meaningful, then the next question becomes how to evaluate these experiments. Here Prentice and Miller make an important observation. They point out that researchers who perform small-effect-size-experiments are not committed to a specific operationalization of a finding. It is one out of many operationalizations that might have been used.

Take for example (my example, not that of Prentice and Miller) a simple semantic priming experiment. The hypothesis is that words (e.g., doctor) are more easily recognized when preceded by a semantically related word (e.g., nurse) than when preceded by a semantically unrelated word (e.g., bread).

There are many ways semantic priming (and more generally the theory of semantic memory) can be tested. For example, we could present the prime words on a list and then present target words as word stems (e.g., do---r). Our prediction then would be that subjects are more likely to complete the word stem as doctor (as opposed to, say, dollar, or dormer) when primed with nurse than when primed with bread.

We could test the same idea in a response-time paradigm, for instance by using a lexical-decision task—in which subjects decide as quickly and as accurately as possible whether a given string of letters is a genuine word—or a naming task, in which subjects merely read the words aloud. The prediction is that lexical decisions and naming are faster for primed words (nurse-doctor) than for words that are not primed (bread-doctor).

Such response-time paradigms open up a plethora of options. It is possible to vary: the amount of time that elapses between the presentation of the prime and that of the target, the presentation duration of the prime, whether or not the prime is masked, the nature of the words being used, the strength of the semantic relation between prime and target, the number of words in the entire experiment, font size, capitalization, font color, and so on.

Combine this with the various ways in which response times can be trimmed or transformed before analysis and you’ve got a huge number of options. Each combination of options will yield a different effect size. But effect size is not the name of the game here. At issue is whether semantic priming occurs or not.

Any combination of options may give rise to an experiment that is diagnostic with respect to the semantic priming hypothesis. The most diagnostic experiment will not be the one with the largest effect size. Rather, it will be the one in which the effect is least likely to occur. There's a good chance this will be the experiment with the smallest effect size. Let’s look at some evidence for this claim.

In a lexical decision task subjects make judgments about words. In a naming task they simply read the words aloud; there is no decision involved and access to the word’s meaning is not necessary to perform the task. This absence of a need to access meaning makes it more difficult to find semantic priming effects in naming than in lexical decision. And indeed, a meta-analysis shows that semantic-priming effects are about twice as large in lexical decision experiments (Cohen’s d =.33) than in naming experiments (Cohen’s d =.16). Still, priming effects are more impressive in naming than in lexical decision.

Prentice and Miller argue that authors should consider the two different goals of experimentation (accounting for maximal variance vs. using the most minimal manipulation) when designing and reporting their studies. I can't recall ever having come across such reporting in the papers I have read but it seems like a good idea.

The take-home message is that we should not dismiss small effects so easily. Tyrion Lannister may be the character that is smallest in stature in Game of Thrones but he is also one of the game’s biggest players.



14 comments:

  1. The first reason is that researchers are trying to establish the minimal conditions under which an effect can be found. They accomplish this by minimally manipulating the independent variable. The second reason is that researchers are invested in showing that an effect occurs even under very unfavorable conditions.
    This is interesting, thanks for pointing to it. It occurred to me that these reasons almost never occur to me, hence I am unimpressed with small effect sizes.

    Roughly, the thought is: I'm not interested in getting some core competence to poke it's head over the ramparts, I'm interested in identifying the composition and organisation of distributed, task specific solutions to problems. Performance is competence, in that framework, and a small effect size indicates you tweaked something that isn't 'mission critical'.

    Useful! This explains a few things for me :)

    ReplyDelete
    Replies
    1. Hi Andrew, I really enjoyed your blog post you’ve linked to (and yours too, of course, Rolf!). I’m looking forward to seeing the next reply in Psych Sci with the running header ‘Tearing Your Paper a New One’, re: “the small effect effect vs Gibson”.

      I very much agree that theories like embodiment need to show large effect sizes. They are claiming that knowledge structures are based on bodily affordances. I guess on this point I’d say the theory claims it is in the cognitive structures where the large effect sizes will be found. That conceptual knowledge will be visible in bodily motion is not a clear prediction of the theory. In fact, it seems like bit of a problem for embodiment to me, which is why I’m not a strong advocate, c.f. Mahon & Caramazza (2008(?)).

      That aside, I agree that large effect sizes are what we are all looking for. But I think there are exceptions to this rule. As Rolf picks up on, the minimal operating condition is one such situation. I’d like to add one more.

      Sometimes the environment alone accounts for a large proportion of the variance. I’m going to take an example I’m more familiar with, but the argument will apply generally to cases where the environment has a large effect on cognition inasmuch as it places a limit on what information is available.

      The example I’ll take is when we are attempting to infer the thoughts of others. We are in general successful liars inasmuch as (1) we tend not to give ourselves away and (2) whatever cues we do give off aren’t picked up on by lie detectors (human or machine).

      Much of the variance is constrained by the environment: there is no Pinocchio’s nose. Extensive training increases accuracy from about 54%, just above chance, to 60%, a mere 6% increase. But as Tim Levine has argued, it may be that only 10% of the liars give themselves away, so there may be a ceiling as to how accurate we can be from what is available. This argument may not be true, but for our purposes let’s assume in our hypothetical world that it has been shown the best one could do with whatever is available in the environment is 60%.

      This has led to a shift, in my opinion, towards a more pragmatic attempt to increase cues to deception by manipulating the behaviour of the speaker. This is a promising strategy, and I suspect it will have some success in the long run. But it abandons the attempt to understand the underlying cognition. The question is, then, even though there appears to be a tight window in which judgmental accuracy might be expected to change, should we consider this an unimportant and uninteresting change? Should we assume that ‘cognition’ per se is not an interesting factor in reaching the judgment other than to know how it transforms environmental information into a judgment outcome, as though a one-to-one mapping can be seen? I’d say this small shift might be very telling because a 6% increase in this instance reflects a difference between almost guessing and approaching ceiling accuracy.

      One might argue that, if we know the environment places these constraints on our ability, any other cognitive effects are less or even not at all important. After all, it is clearly the state of the environment that is the biggest hurdle for lie detection. In response, I’d say that, given a stable unchanging environment, it is important to know how, say, memory, prior expectations, and so on can influence how we make these judgments, as they undoubtedly will be shown to do.

      I’d be interested in hearing more thoughts on this: is it really worth explaining the 6%, or would we be better off trying to explain what factors in the environment explain the majority of the variance?

      Delete
    2. I'm glad this has been educational. Otherwise I do indeed think that different theoretical frameworks may have different views on effect sizes. How you articulate this is a matter of style of course.

      Delete
  2. Isn't there a connection between Popperian falsification and the interest in small effect sizes, via fisherian statistics? If we're trying to falsify a hypothesis about the underlying mechanisms we often do that by equating two conditions in all ways except for one factor, which (under the null hypothesis) shouldn't have any influence on the results. A significant result, of any effect size, allows us to reject the null. We falsify the theory, and so - the dogma goes - progress is made.

    If you either don't sign up to falsficationism as a method, or you have a null model which says you should always get some effect (as Andrew does), then your interest in small effect sizes correspondingly diminishes

    ReplyDelete
    Replies
    1. Dear Tom, I don't think Popperian falsification is concerned with falsifying the null hypothesis - the goal is to falsify a theory, by testing predictions from the theory, so this would be a falsification of the alternative hypothesis. Most often, psychologists are not trying to falsify their predictions in a Popperian fashion - in the case that the null-hypothesis is rejected, there is actually only confirmation of the prediction, not falsification of any theory. In the case of priming, the prediction is that priming is possible, not that it is not possible. Finding that it is possible is a confirmation. A falsification would be a critical experiment where your theory predicts that priming effects should occur, but you do not find a priming effect. The problem is, priming theories are so underdeveloped (one might even say they are actually not truly scientific theories, in the way that philosophy of science requires theories to be), that is seems very difficult to think of such a critical experiment. So, the problem is not with falsification, or null hypothesis significance testing, but with a lack of theory.

      Delete
    2. Daniel, it depends on whether you talk about social priming or semantic priming. I wouldn't say semantic priming theories are underdeveloped. The issue there is not so much whether or not priming occurs but what the organization of semantic memory is. This is why I said I was using a simple example. I used it for expository purposes, not as a characterization of the semantic priming literature.

      In social priming, there has as yet not been much interest in underlying mechanisms (but see my earlier posts on social priming).

      Delete
    3. Daniel, notwithstanding your comments about how scientists are normally trying to confirm rather than falsify, and about social priming, on which I share similar views, there is a link between significance testing and falsification. For both, the "win" is if you prove something not true (reject the null, falsify a theory). It seems plausible this common core is behind the celebration of effects in themselves and the relative neglect of effect size as a matter of importance.

      Delete
  3. I agree with the general notion that small effects can be interesting for the reasons given. However, the argument about the (lack of) importance of effect sizes strikes me as misleading and incoherent. First, effect size is argued to be unimportant: "Each combination of options will yield a different effect size. But effect size is not the name of the game here. At issue is whether semantic priming occurs or not." However, in further developing the same semantic priming example effect size suddenly becomes important (and not just the questions of whether there is a priming effect or not): "And indeed, a meta-analysis shows that semantic-priming effects are about twice as large in lexical decision experiments (Cohen’s d =.33) than in naming experiments (Cohen’s d =.16)."

    Ignoring effect size and asking 'is there (still) a priming effect?' is akin to a psychophysical experiment asking 'can this faint noise be heard or not?' And since the development of signal detection theory, this question is pretty vacuous.

    I also fail to see why a priming effect of d = 0.16 for naming experiments is more impressive than d = 0.33 for lexical decision. Given constant false alarm rates, is a hit rate of 55% for a very faint noise more impressive than a hit rate of 60% for a faint noise?

    ReplyDelete
    Replies
    1. The argument would be that lexical decision has a decision component and thus generally takes longer than naming, which gives the prime more chance to influence the target. Naming is a more implicit task that doesn't require access to meaning (you can pronounce phonotactically possible nonwords like floint.

      Delete
  4. Here's another worry about effect sizes I have: the value of an effect size, large or small, also depends on how meaningful the variable is. As psychologists, we're pretty skilled at choosing variables for our experiments which demonstrate mutability under the conditions we're interested in (when we train students this is called avoiding floor and ceiling effects). In some domains the variables are inherently valuable (e.g. hit rate, which Johannes mentioned). In other domains, it is less clear that the variable used is valuable (I am thinking of many social priming studies here). Although the effect size may show that subects' questionnaire answers, or their time to talk a corridor, are mutable, it doesn't mean that these variables are meaningfully mutable, in the sense that they inform us about the concepts putatively under investigation.

    ReplyDelete
    Replies
    1. This is just the issue of validity, surely. I've always thought this was the hard problem; people get so excited about demonstrating reliability, I've never understood it - that's the easy bit :)

      This is why I'm getting into the task dynamics stuff; it provides you with a finite but complete set of independent and dependent variables to play with that comes from the task, not from the experimenter. My job is just to identify the relevant dynamics, and whether I've done that right comes out of the data. Real ecological validity!

      Delete
    2. Tom, I agree 100% with regard to the social priming studies you mention.

      Delete
  5. Effect sizes matter because (among other reasons) psychologists routinely make claims about about the power of effects. They do not always say "This effect is very powerful." Instead, they say things that imply powerful effects. Examples:
    The Unbearable Automaticity of Being
    The sane are indistinguishable from the insane
    Stereotypes are the default basis of person perception.
    Social beliefs create social reality more than social reality
    creates social beliefs.
    Teacher expectations cause student achievement more than student achievement causes teacher expectations.
    The Power of the Situation!
    Reign of Error.

    I could go on. But do I need to? If you need me to, you can
    just visit this:
    http://pigee.wordpress.com/2013/02/23/when-effect-sizes-matter-the-internal-incoherence-of-much-of-social-psychology/
    See especially my 2/25/13 reply to Dave Nussbaum, towards the bottom of the discussion.

    Lee

    ReplyDelete
  6. I find the whole discussion a bit confusing. Apparently small effects can be important, because the importance of an effect depends on all sorts of contextual factors as well as its apparent size.

    However, one can not in general compare the size of an effect using standardized effect size metrics. Thus d = 0.4 in study 1 could be smaller or larger than d = 0.2 in study 2. Standardized effect size metrics can only be compared in the (in practice rather uncommon) special case that the standardizer (variance or SD) is identical for each effect (assuming that the effects are measuring the same thing in the first place).

    http://www.academia.edu/167213/Standardized_or_simple_effect_size_What_should_be_reported

    Priming is a case in point - as different studies sample populations with different variability and use different estimates of variability the use of standardized metrics can be extremely misleading.

    ReplyDelete