Showing posts with label correlation. Show all posts
Showing posts with label correlation. Show all posts
Tuesday, 13 November 2012
Flaky chocolate and the New England Journal of Medicine
Early in October a weird story hit the media: a nation’s chocolate consumption is predictive of its number of Nobel prize-winners, after correcting for population size. This is the kind of kooky statistic that journalists love, and the story made a splash. But was it serious? Most academics initially assumed not. The source of the story was the New England Journal of Medicine, an august publication with stringent standards, which triages a high proportion of submissions that don’t get sent out for review. (And don't try asking for an explanation of why you’ve been triaged). It seemed unlikely that a journal with such exacting standards would give space to a lightweight piece on chocolate. So the first thought was that the piece had been published to make a point about the dangers of assuming causation from correlation, or the inaccuracies that can result when a geographical region is used as the unit of analysis. But reading the article more carefully gave one pause. It did have a somewhat jocular tone. Yet if this was intended as a cautionary tale, we might have expected it to be accompanied by some serious discussion of the methodological and interpretive problems with this kind of analysis. Instead, beneficial effects of dietary flavanols was presented as the most plausible explanation of the findings.
The author, cardiologist Franz Messerli, did discuss the possibility of a non-causal explanation for the findings, only to dismiss it. He stated “as to a third hypothesis, it is difficult to identify a plausible common denominator that could possibly drive both chocolate consumption and the number of Nobel laureates over many years. Differences in socioeconomic status from country to country and geographic and climatic factors may play some role, but they fall short of fully explaining the close correlation observed.” And how do we know “they fall short?” Well, because the author, Dr Messerli, says so.
As is often the case, the blogosphere did a better job of critiquing the paper than the journal editors and reviewers (see, for instance, here and here). The failure to consider seriously the role of a third explanatory variable was widely commented on, but, as far as I am aware, nobody actually did the analysis that Messerli should have done. I therefore thought I'd give it a go. Messerli explained where he’d got his data from – a chocolatier’s website and Wikipedia – so it was fairly straightforward to reproduce them (with some minor differences due to missing data from one chocolate website that's gone offline). Wikipedia helpfully also provided data on gross domestic product (GDP) per head for different nations, and it was easy to find another site with data on proportion of GDP spend on education (except China, which has figures here). So I re-ran the analysis, computing the partial correlation between chocolate consumption and Nobel prizes after adjusting for spend per head on education. When education spend was partialled out, the correlation dropped from .73 to .41, just falling short of statistical significance.
Since Nobel laureates typically are awarded their prizes only after a long period of achievement, a more convincing test of the association would be based on data on both chocolate consumption and education spend from a few decades ago. I’ve got better things to do than to dig out the figures, but I suggest that Dr Messerli might find this a useful exercise.
Another point to note is that the mechanism proposed by Dr Messerli involves an impact of improved cardiovascular fitness on cognitive function. The number of Nobel laureates is not the measure one would pick if setting out to test this hypothesis. The topic of national differences in ability is a contentious and murky one, but it seemed worth looking at such data as are available on the web to see what the chocolate association looks like when a more direct measure is used. For the same 22 countries, the correlation between chocolate consumption and estimated average cognitive ability is nonsignificant at .24, falling to .13 when education spend is partialled out.
I did write a letter to the New England Journal of Medicine reporting the first of my analyses (all there was room for: they allow you 175 words), but, as expected, they weren't interested. "I am sorry that we will not be able to print your recent letter to the editor regarding the Messerli article of 18-Oct-2012." they wrote. "The space available for correspondence is very limited, and we must use our judgment to present a representative selection of the material received."
It took me all of 45 minutes to extract the data and run these analyses. So why didn’t Dr Messerli do this? And why did the NEJM editor allow him to get away with asserting that third variables “fall short” when it’s so easy to check it out? Could it be that in our celebrity-obsessed world, the journal editors think that there’s no such thing as bad publicity?
Messerli, F. (2012). Chocolate Consumption, Cognitive Function, and Nobel Laureates New England Journal of Medicine, 367 (16), 1562-1564 DOI: 10.1056/NEJMon1211064
Sunday, 24 June 2012
Causal models of developmental disorders: the perils of correlational data
Experimental psychology depends heavily on statistics, but
psychologists don’t always agree about the best ways of analyzing data. Take
the following problem:
I have two groups each of 30 children, dyslexics and
controls. I give them a test of auditory discrimination and find a significant
difference between the groups, with the dyslexic mean being lower. I want to
see whether reading ability is related to the auditory task. I compute the
correlation between the auditory measure and reading, and find it is .42, which
in a sample of 64 cases is significant at the .001 level.
I write up the results, concluding that poor auditory skill
is a risk factor for poor reading. But reviewers are critical.
So what’s wrong with this?
I’ll deal quickly with two obvious points. First, there is
the well-worn phrase that correlation does not equal causation. The correlation
could reflect a causal link from auditory deficit to poor reading,
but we need also to consider other causal routes, as I’ll illustrate further
below. This is an issue about interpretation rather than data analysis.
A second point concerns the need to look at the data rather
than just computing the correlation statistic. Correlations can be sensitive to
distributional properties of the data and can be heavily influenced by
outliers. There are statistical ways of checking for such effects, but a good
first step is just plotting a scatterplot to see whether the data look orderly.
A tip for students: if your supervisor
asks to see your project data, don’t just turn up with numerical output from
the analysis: be ready to show some plots.
![]() |
| Figure 1: Fictitious data showing spurious correlation between height and reading ability |
A less familiar point concerns the pooling of data across
the dyslexic and control groups. Some people have strong views about this, yet,
as far as I’m aware, it hasn’t been discussed much in the context of
developmental disorders. I therefore felt it would be good to give it an airing
on my blog and see what others think.
Let’s start with a fictitious example that illustrates the
dangers of pooling data from two groups. Figure 1 is a scatterplot showing the
correlation between height and reading ability in groups of 6-year-olds and
10-year-olds. If I pool across groups, I’m likely to see a strong correlation
between height and reading ability, whereas within any one age group the
correlation is negligible. This is a clear case of spurious correlation, as
illustrated in Figure 2. Here the case against pooling is unambiguous, and it's
clear that if you look at the correlation within either age band, there is no
relationship between reading ability and height.
![]() |
| Figure 2: Model showing how a spurious correlation between height and reading arises because both are affected by age |
Examples such as this have led some people to argue that you
shouldn’t pool data in studies such as the dyslexic vs. control example. Or, to
be more precise, the recommendation is usually that you should check the
correlations within each group, and
avoid pooling if they don’t look consistent with the pooled correlation. I’ve
always been a bit uneasy about this logic and have been giving some thought as
to why.
First, there is the simple issue of power. If you halve your
sample size, then you increase the standard error of estimate for a correlation
coefficient, making it more likely that it will be nonsignificant. Figure 3
shows the 95% confidence intervals around a correlation of .5 depending on
sample size, and you can readily see that these are larger for small than big
samples. There's a nice website by Stan
Brown that gives relevant formulae in Excel.
![]() |
| Figure 3: 95% confidence interval around estimated correlation of .5, with different sample sizes |
A less obvious point is that the data in Figure 1 look
analogous to the dyslexic vs. control example, but there is an important
difference. We know where we are with age: it is unambiguous to define and measure.
But dyslexia is more tricky. Suppose we substitute dyslexia for age, and
auditory processing for height, in the model of spurious correlation in Figure
2. We have a problem: there is no independent diagnostic test for dyslexia. It
is actually defined in terms of one of our correlated variables, reading
ability. Thus, the criterion used to allocate children to groups is not
independent of the measures that are entered into the correlation. This creates
distortions in within-group correlations, as follows.
If we define our groups in terms of their scores on one
variable, we effectively restrict the range of values obtained by each group,
and this lowers the correlation. Furthermore, the restriction will be less for
the controls than for the dyslexic group - who are typically selected as
scoring below a low cutoff, such as one SD below the mean. Figure 4 shows simulated
data for two groups selected from a population where the true correlation
between variables A and B is .5. Thirty individuals (dyslexics) are selected as
scoring more than 1 SD below average on variable A, and another 30 (controls)
are selected as scoring above this level.
![]() |
| Figure 4: Correlations obtained in samples of dyslexic (red) and controls (blue) for 20 runs of simulation with N = 30 per group. |
The Figure shows correlations from twenty
runs of this simulation. For both groups, the average correlation is less than
the true value of .5, because of the restricted range of scores on variable A.
However, because the range is more restricted for the dyslexic group, their
average correlation is lower than that of the controls. A correlation of .42 corresponds to the .05 significance level for a sample of
this size, and we can see that the controls are more likely to exceed this
value than the dyslexic group. All these results are just artefacts of the way
in which the groups were selected: both groups come from the same population
where r = .5.
What can we conclude from all this? Well, the bottom line is
that if we find non-significant within-group
correlations this does not necessarily invalidate a causal model. The
simulation shows that we may find that within-group correlations look quite
different in dyslexic and control groups, even if they come from a common
distribution.
So where does this leave us?! It would seem that in general,
within-group data are unlikely to help us distinguish between causal and
non-causal models: they may be compatible with both. So how should we proceed?
There’s no simple solution, but here are some suggestions:
1. If considering correlational data, always report the 95%
confidence interval. Usually people (including me!) just report the correlation coefficient,
degrees of freedom and p-value. It’s so uncommon to add confidence intervals
that I suspect most psychologists don’t know how to compute it. Do not assume
that because one correlation is significant and another is not that they are
meaningfully different. This
website can be used to test for the significance of the difference between
correlations. I would, however, advise against interpreting such a comparison
if your data are affected by the kinds of restriction of range discussed above.
2. Study the relationship between key variables in a large unselected
sample covering a wide range of scores. This is a more tractable solution, but
is seldom done. Typically, people recruit an equivalent number of cases and
controls, with a sample size that is inadequate for getting a precise estimate
of a correlation in either group. If your underlying model predicts a linear
relationship between, say, auditory processing and phonological awareness, then
with a sample of 200 cases, a fairly precise estimate can be obtained. With this approach, one
can also identify whether the relationship is linear.
3. More generally, it’s important to be explicit about what
models you are testing. For instance, I’ve identified four underlying models of
the relationship between auditory deficit and language impairment, as shown in Figure
5. In general, correlational data on these two skills won’t distinguish between
these models, but specifying the alternatives may help you think of other data
that could be informative.
![]() |
| Figure 5: Models of causal relationships underlying observed correlation between auditory deficit and language impairment |
For instance:
- We found that, when studying heritable conditions, it is useful to include data on parents or siblings. Models differ in predictions about how measures of genetic risk - for instance, family history, or presence of specific genetic variants - relate to A (auditory deficit) and B (language impairment) in the child. This approach is illustrated in this paper. Interestingly, we found that the causal model that is often implicitly assumed, which we termed the Endophenotype model, did not fit the data, but nor did the spurious correlation model, which corresponds here to the Pleiotropy model.
- There may be other groups that can be informative: for instance, if you think auditory deficits are key in causing language problems, it may be worth including children with hearing loss in a study - see this paper for an example of this approach using converging evidence.
- Longitudinal data can help distinguish whether A causes B or B causes A.
- Training studies are particularly powerful, in allowing one to manipulate A and see if it changes B.
So what’s the bottom line? In general, correlational data
from small samples of clinical and control groups are inadequate for testing
causal models. They can lead to type I errors, where pooling data leads to a
spurious association between variables, but also to type II errors, where a
genuine association is discounted because it isn’t evident within subject
groups. For the field to move forward, we need to go beyond correlational data.
P.S. 9th July 2012
I've written a little tutorial on simulating data using R to illustrate some of these points. No prior knowledge of R required. see: http://tinyurl.com/d2868cg
Bishop DV, Hardiman MJ, & Barry JG (2012). Auditory deficit as a consequence rather than endophenotype of specific language impairment: electrophysiological evidence. PloS one, 7 (5) PMID: 22662112P.S. 9th July 2012
I've written a little tutorial on simulating data using R to illustrate some of these points. No prior knowledge of R required. see: http://tinyurl.com/d2868cg
If you liked this post, you may also be interested in my other posts on statistical topics:
Getting genetic effect sizes in perspective
The joys of inventing data
A short nerdy post about the use of percentiles
The difference between p < .05 and a screening test
Labels:
correlation,
psychology,
statistics,
structural model
Subscribe to:
Comments (Atom)






