The Tyrion Lannister Paradox: How Small Effect Sizes can be Important

There has been a lot of debate lately about effect sizes. On the one hand, there are effects in the social priming literature that seem surprisingly large given the subtlety of the manipulation, the between-subjects design, and the (small) sample size. On the other hand, some researchers can be heard complaining about small effect sizes in other areas of study (for example cognitive psychology). Why would we want to study small effects?

This is not a new question. We could go further back in history but let’s stop in 1992, the year in which an insightful article on small effect sizes appeared, authored by Deborah Prentice and Dale Miller. Prentice and Miller argue that there are two valid reasons why psychologists study small effects.

The first reason is that researchers are trying to establish the minimal conditions under which an effect can be found. They accomplish this by minimally manipulating the independent variable. The second reason is that researchers are invested in showing that an effect occurs even under very unfavorable conditions.

According to this analysis, there are two modes of experimentation. One is targeted at accounting for maximal variance and is therefore interested in big effects. And the other is aiming to provide the most stringent tests of a hypothesis.

So researchers who study small effects generally (generally being the operative word here) aren’t doing this because they enjoy being obscure, esoteric, fanciful, eccentric, absurd, ludicrous, kooky, or wacky. They are simply trying to be good scientists. An experiment might look farfetched but this doesn’t mean it is. It might very well be the product of rigorous scientific thought.

If we accept experiments with small effect sizes as scientifically meaningful, then the next question becomes how to evaluate these experiments. Here Prentice and Miller make an important observation. They point out that researchers who perform small-effect-size-experiments are not committed to a specific operationalization of a finding. It is one out of many operationalizations that might have been used.

Take for example (my example, not that of Prentice and Miller) a simple semantic priming experiment. The hypothesis is that words (e.g., doctor) are more easily recognized when preceded by a semantically related word (e.g., nurse) than when preceded by a semantically unrelated word (e.g., bread).

There are many ways semantic priming (and more generally the theory of semantic memory) can be tested. For example, we could present the prime words on a list and then present target words as word stems (e.g., do---r). Our prediction then would be that subjects are more likely to complete the word stem as doctor (as opposed to, say, dollar, or dormer) when primed with nurse than when primed with bread.

We could test the same idea in a response-time paradigm, for instance by using a lexical-decision task—in which subjects decide as quickly and as accurately as possible whether a given string of letters is a genuine word—or a naming task, in which subjects merely read the words aloud. The prediction is that lexical decisions and naming are faster for primed words (nurse-doctor) than for words that are not primed (bread-doctor).

Such response-time paradigms open up a plethora of options. It is possible to vary: the amount of time that elapses between the presentation of the prime and that of the target, the presentation duration of the prime, whether or not the prime is masked, the nature of the words being used, the strength of the semantic relation between prime and target, the number of words in the entire experiment, font size, capitalization, font color, and so on.

Combine this with the various ways in which response times can be trimmed or transformed before analysis and you’ve got a huge number of options. Each combination of options will yield a different effect size. But effect size is not the name of the game here. At issue is whether semantic priming occurs or not.

Any combination of options may give rise to an experiment that is diagnostic with respect to the semantic priming hypothesis. The most diagnostic experiment will not be the one with the largest effect size. Rather, it will be the one in which the effect is least likely to occur. There's a good chance this will be the experiment with the smallest effect size. Let’s look at some evidence for this claim.

In a lexical decision task subjects make judgments about words. In a naming task they simply read the words aloud; there is no decision involved and access to the word’s meaning is not necessary to perform the task. This absence of a need to access meaning makes it more difficult to find semantic priming effects in naming than in lexical decision. And indeed, a meta-analysis shows that semantic-priming effects are about twice as large in lexical decision experiments (Cohen’s d =.33) than in naming experiments (Cohen’s d =.16). Still, priming effects are more impressive in naming than in lexical decision.

Prentice and Miller argue that authors should consider the two different goals of experimentation (accounting for maximal variance vs. using the most minimal manipulation) when designing and reporting their studies. I can't recall ever having come across such reporting in the papers I have read but it seems like a good idea.

The take-home message is that we should not dismiss small effects so easily. Tyrion Lannister may be the character that is smallest in stature in Game of Thrones but he is also one of the game’s biggest players.

Reacties

Andrew22 juni 2013 om 14:32
The first reason is that researchers are trying to establish the minimal conditions under which an effect can be found. They accomplish this by minimally manipulating the independent variable. The second reason is that researchers are invested in showing that an effect occurs even under very unfavorable conditions.
This is interesting, thanks for pointing to it. It occurred to me that these reasons almost never occur to me, hence I am unimpressed with small effect sizes.

Roughly, the thought is: I'm not interested in getting some core competence to poke it's head over the ramparts, I'm interested in identifying the composition and organisation of distributed, task specific solutions to problems. Performance is competence, in that framework, and a small effect size indicates you tweaked something that isn't 'mission critical'.

Useful! This explains a few things for me :)
BeantwoordenVerwijderen
Reacties
Tom Stafford22 juni 2013 om 17:14
Isn't there a connection between Popperian falsification and the interest in small effect sizes, via fisherian statistics? If we're trying to falsify a hypothesis about the underlying mechanisms we often do that by equating two conditions in all ways except for one factor, which (under the null hypothesis) shouldn't have any influence on the results. A significant result, of any effect size, allows us to reject the null. We falsify the theory, and so - the dogma goes - progress is made.

If you either don't sign up to falsficationism as a method, or you have a null model which says you should always get some effect (as Andrew does), then your interest in small effect sizes correspondingly diminishes
BeantwoordenVerwijderen
Reacties
Unknown22 juni 2013 om 17:23
I agree with the general notion that small effects can be interesting for the reasons given. However, the argument about the (lack of) importance of effect sizes strikes me as misleading and incoherent. First, effect size is argued to be unimportant: "Each combination of options will yield a different effect size. But effect size is not the name of the game here. At issue is whether semantic priming occurs or not." However, in further developing the same semantic priming example effect size suddenly becomes important (and not just the questions of whether there is a priming effect or not): "And indeed, a meta-analysis shows that semantic-priming effects are about twice as large in lexical decision experiments (Cohen’s d =.33) than in naming experiments (Cohen’s d =.16)."

Ignoring effect size and asking 'is there (still) a priming effect?' is akin to a psychophysical experiment asking 'can this faint noise be heard or not?' And since the development of signal detection theory, this question is pretty vacuous.

I also fail to see why a priming effect of d = 0.16 for naming experiments is more impressive than d = 0.33 for lexical decision. Given constant false alarm rates, is a hit rate of 55% for a very faint noise more impressive than a hit rate of 60% for a faint noise?
BeantwoordenVerwijderen
Reacties
Tom Stafford23 juni 2013 om 10:52
Here's another worry about effect sizes I have: the value of an effect size, large or small, also depends on how meaningful the variable is. As psychologists, we're pretty skilled at choosing variables for our experiments which demonstrate mutability under the conditions we're interested in (when we train students this is called avoiding floor and ceiling effects). In some domains the variables are inherently valuable (e.g. hit rate, which Johannes mentioned). In other domains, it is less clear that the variable used is valuable (I am thinking of many social priming studies here). Although the effect size may show that subects' questionnaire answers, or their time to talk a corridor, are mutable, it doesn't mean that these variables are meaningfully mutable, in the sense that they inform us about the concepts putatively under investigation.
BeantwoordenVerwijderen
Reacties
I'dratherbeplayingtennis22 augustus 2013 om 17:35
Effect sizes matter because (among other reasons) psychologists routinely make claims about about the power of effects. They do not always say "This effect is very powerful." Instead, they say things that imply powerful effects. Examples:
The Unbearable Automaticity of Being
The sane are indistinguishable from the insane
Stereotypes are the default basis of person perception.
Social beliefs create social reality more than social reality
creates social beliefs.
Teacher expectations cause student achievement more than student achievement causes teacher expectations.
The Power of the Situation!
Reign of Error.

I could go on. But do I need to? If you need me to, you can
just visit this:
http://pigee.wordpress.com/2013/02/23/when-effect-sizes-matter-the-internal-incoherence-of-much-of-social-psychology/
See especially my 2/25/13 reply to Dave Nussbaum, towards the bottom of the discussion.

Lee
BeantwoordenVerwijderen
Reacties
Anoniem23 augustus 2013 om 15:18
I find the whole discussion a bit confusing. Apparently small effects can be important, because the importance of an effect depends on all sorts of contextual factors as well as its apparent size.

However, one can not in general compare the size of an effect using standardized effect size metrics. Thus d = 0.4 in study 1 could be smaller or larger than d = 0.2 in study 2. Standardized effect size metrics can only be compared in the (in practice rather uncommon) special case that the standardizer (variance or SD) is identical for each effect (assuming that the effects are measuring the same thing in the first place).

http://www.academia.edu/167213/Standardized_or_simple_effect_size_What_should_be_reported

Priming is a case in point - as different studies sample populations with different variability and use different estimates of variability the use of standardized metrics can be extremely misleading.

BeantwoordenVerwijderen
Reacties

Reactie toevoegen

Drang naar Samenhang

Zoeken in deze blog

The Tyrion Lannister Paradox: How Small Effect Sizes can be Important

Reacties

Een reactie posten