Spurious correlation of ratios explained

In statistics, spurious correlation of ratios is a form of spurious correlation that arises between ratios of absolute measurements which themselves are uncorrelated.[1] [2]

The phenomenon of spurious correlation of ratios is one of the main motives for the field of compositional data analysis, which deals with the analysis of variables that carry only relative information, such as proportions, percentages and parts-per-million.[3] [4]

Spurious correlation is distinct from misconceptions about correlation and causality.

Illustration of spurious correlation

Pearson states a simple example of spurious correlation:

The scatter plot above illustrates this example using 500 observations of x, y, and z. Variables x, y and z are drawn from normal distributions with means 10, 10, and 30, respectively, and standard deviations 1, 1, and 3 respectively, i.e.,

\begin{align} x,y&\simN(10,1)\\ z&\simN(30,3)\\ \end{align}

Even though x, y, and z are statistically independent and therefore uncorrelated, in the depicted typical sample the ratios x/z and y/z have a correlation of 0.53. This is because of the common divisor (z) and can be better understood if we colour the points in the scatter plot by the z-value. Trios of (xyz) with relatively large z values tend to appear in the bottom left of the plot; trios with relatively small z values tend to appear in the top right.

Approximate amount of spurious correlation

Pearson derived an approximation of the correlation that would be observed between two indices (

x1/x3

and

x2/x4

), i.e., ratios of the absolute measurements

x1,x2,x3,x4

:

\rho=

r12v1v2-r14v1v4-r23v2v3+r34v3v4
2
\sqrt{v+
2
v
3
-2r13v1v3
1
2
\sqrt{v
2

+

2
v
4

-2r24v2v4}}

where

vi

is the coefficient of variation of

xi

, and

rij

the Pearson correlation between

xi

and

xj

.

This expression can be simplified for situations where there is a common divisor by setting

x3=x4

, and

x1,x2,x3

are uncorrelated, giving the spurious correlation:

\rho0=

2
v
3
2
\sqrt{v+
2
v
3
1
2
\sqrt{v
2

+

2}}.
v
3

For the special case in which all coefficients of variation are equal (as is the case in the illustrations at right),

\rho0=0.5

Relevance to biology and other sciences

Pearson was joined by Sir Francis Galton[5] and Walter Frank Raphael Weldon in cautioning scientists to be wary of spurious correlation, especially in biology where it is common[6] to scale or normalize measurements by dividing them by a particular variable or total. The danger he saw was that conclusions would be drawn from correlations that are artifacts of the analysis method, rather than actual “organic” relationships.

However, it would appear that spurious correlation (and its potential to mislead) is not yet widely understood. In 1986 John Aitchison, who pioneered the log-ratio approach to compositional data analysis wrote:More recent publications suggest that this lack of awareness prevails, at least in molecular bioscience.[7] [8]

Notes and References

  1. Pearson. Karl. Mathematical Contributions to the Theory of Evolution – On a Form of Spurious Correlation Which May Arise When Indices Are Used in the Measurement of Organs. Proceedings of the Royal Society of London. 1896. 60. 359–367. 489–498. 115879. 10.1098/rspl.1896.0076.
  2. Aldrich . John . Correlations Genuine and Spurious in Pearson and Yule . Statistical Science . 1995 . 10 . 4 . 364–376 . 10.1214/ss/1177009870 . free .
  3. Book: Aitchison, John. The statistical analysis of compositional data. 1986. Chapman & Hall. 978-0-412-28060-3.
  4. Book: Compositional Data Analysis: Theory and Applications. 2011. Wiley. 978-0470711354. Vera. Pawlowsky-Glahn. Antonella . Buccianti. Antonella Buccianti . 10.1002/9781119976462.
  5. Galton. Francis. Note to the memoir by Professor Karl Pearson, F.R.S., on spurious correlation. Proceedings of the Royal Society of London. 1896. 60. 359–367. 498–502. 10.1098/rspl.1896.0077. 170846631 .
  6. Jackson. DA. Somers. KM. The Spectre of 'Spurious' Correlation. Oecologia. 1991. 86. 1. 147–151. 4219582. 10.1007/bf00317404. 28313173. 1991Oecol..86..147J. 1116627 .
  7. Book: David. Lovell. Warren. Müller. Jen. Taylor. Alec. Zwart. Chris. Helliwell. Chapter 14: Proportions, Percentages, PPM: Do the Molecular Biosciences Treat Compositional Data Right?. Compositional Data Analysis: Theory and Applications. 2011. Wiley. 9780470711354. Vera. Pawlowsky-Glahn. Antonella . Buccianti. 10.1002/9781119976462.
  8. Lovell. David. Pawlowsky-Glahn. Vera. Egozcue. Juan José. Marguerat. Samuel. Bähler. Jürg. Proportionality: A Valid Alternative to Correlation for Relative Data. PLOS Computational Biology. 16 March 2015. 10.1371/journal.pcbi.1004075. 25775355. 4361748. 11. 3. e1004075. 2015PLSCB..11E4075L . free .