Compositional data explained

In statistics, compositional data are quantitative descriptions of the parts of some whole, conveying relative information. Mathematically, compositional data is represented by points on a simplex. Measurements involving probabilities, proportions, percentages, and ppm can all be thought of as compositional data.

Ternary plot

Compositional data in three variables can be plotted via ternary plots. The use of a barycentric plot on three variables graphically depicts the ratios of the three variables as positions in an equilateral triangle.

Simplicial sample space

In general, John Aitchison defined compositional data to be proportions of some whole in 1982.[1] In particular, a compositional data point (or composition for short) can be represented by a real vector with positive components. The sample space of compositional data is a simplex:

D=\left\{x=[x
l{S}
1,x

2,...,x

D
D]\inR

\left|xi>0,i=1,2,...,D;

D
\sum
i=1

xi=\kappa\right.\right\}.

The only information is given by the ratios between components, so the information of a composition is preserved under multiplication by any positive constant. Therefore, the sample space of compositional data can always be assumed to be a standard simplex, i.e.

\kappa=1

. In this context, normalization to the standard simplex is called closure and is denoted by

\scriptstylel{C}[ ⋅ ]

:

l{C}[x1,x2,...,x

,
D]=\left[x1
D
\sumxi
i=1
x2
D
\sumxi
i=1

,...,

xD
D
\sumxi
i=1

\right],

where D is the number of parts (components) and

[]

denotes a row vector.

Aitchison geometry

The simplex can be given the structure of a vector space in several different ways. The following vector space structure is called Aitchison geometry or the Aitchison simplex and has the following operations:

Perturbation (vector addition)

xy=\left[

x1y1,
D
\sumxiyi
i=1
x2y2
D
\sumxiyi
i=1

,...,

xDyD
D
\sumxiyi
i=1

\right]=C[x1y1,\ldots,xDyD]    \forallx,y\inSD

Powering (scalar multiplication)

\alpha\odotx=\left[

\alpha
x
1
,
D
\sum
\alpha
x
i
i=1
\alpha
x
2
D
\sum
\alpha
x
i
i=1

,\ldots,

\alpha
x
D
D
\sum
\alpha
x
i
i=1

\right]=

\alpha,
C[x
1

\ldots,

\alpha]
x
D

   \forallx\inSD,\alpha\inR

Inner product

\langlex,y\rangle=

1
2D
D
\sum
i=1
D log
\sum
j=1
xi
xj

log

yi
yj

   \forallx,y\inSD

Under these operations alone, it is sufficient to show that the Aitchison simplex forms a

(D-1)

-dimensional Euclidean inner product space. The uniform composition
\left[1
D

,...,

1
D

\right]

is the zero vector.

Orthonormal bases

Since the Aitchison simplex forms a finite dimensional Hilbert space, it is possible to construct orthonormal bases in the simplex. Every composition

x

can be decomposed as follows

x=

D
oplus
i=1
*
x
i

\odotei

where

e1,\ldots,eD-1

forms an orthonormal basis in the simplex. The values
*,
x
i

i=1,2,\ldots,D-1

are the (orthonormal and Cartesian) coordinates of

x

with respect to the given basis. They are called isometric log-ratio coordinates

(\operatorname{ilr})

.

Linear transformations

There are three well-characterized isomorphisms that transform from the Aitchison simplex to real space. All of these transforms satisfy linearity and as given below

Additive log ratio transform

The additive log ratio (alr) transform is an isomorphism where

\operatorname{alr}:SDRD-1

. This is given by

\operatorname{alr}(x)=\left[log

x1
xD

log

xD-1
xD

\right]

The choice of denominator component is arbitrary, and could be any specified component.This transform is commonly used in chemistry with measurements such as pH. In addition, this is the transform most commonly used for multinomial logistic regression. The alr transform is not an isometry, meaning that distances on transformed values will not be equivalent to distances on the original compositions in the simplex.

Center log ratio transform

The center log ratio (clr) transform is both an isomorphism and an isometry where

\operatorname{clr}:SDU,U\subsetRD

\operatorname{clr}(x)=\left[log

x1
g(x)

log

xD
g(x)

\right]

Where

g(x)

is the geometric mean of

x

. The inverse of this function is also known as the softmax function.

Isometric logratio transform

The isometric log ratio (ilr) transform is both an isomorphism and an isometry where

\operatorname{ilr}:SDRD-1

\operatorname{ilr}(x)=[\langlex,e1\rangle,\ldots,\langlex,eD-1\rangle]

There are multiple ways to construct orthonormal bases, including using the Gram–Schmidt orthogonalization or singular-value decomposition of clr transformed data. Another alternative is to construct log contrasts from a bifurcating tree. If we are given a bifurcating tree, we can construct a basis from the internal nodes in the tree.

Each vector in the basis would be determined as follows

e\ell=C[\exp(\underbrace{0,\ldots,0}k,\underbrace{a,\ldots,a}r,\underbrace{b,\ldots,b}s,\underbrace{0,\ldots,0}t)]

The elements within each vector are given as follows

a=

\sqrt{s
} \quad \text \quad b = \frac

where

k,r,s,t

are the respective number of tips in the corresponding subtrees shown in the figure. It can be shown that the resulting basis is orthonormal

Once the basis

\Psi

is built, the ilr transform can be calculated as follows

\operatorname{ilr}(x)=\operatorname{clr}(x)\PsiT

where each element in the ilr transformed data is of the following form

bi=\sqrt{

rs
r+s
} \log \frac

where

xR

and

xS

are the set of values corresponding to the tips in the subtrees

R

and

S

Examples

See also

External links

Notes and References

  1. Aitchison. John. The Statistical Analysis of Compositional Data. Journal of the Royal Statistical Society. Series B (Methodological). 44. 2. 1982. 139–177. 10.1111/j.2517-6161.1982.tb01195.x.
  2. Olea . Ricardo A. . Martín-Fernández . Josep A. . Craddock . William H. . 2021 . Multivariate classification of the crude oil petroleum systems in southeast Texas, USA, using conventional and compositional analysis of biomarkers . In Advances in Compositional Data Analysis—Festschrift in honor of Vera-Pawlowsky-Glahn, Filzmoser, P., Hron, K., Palarea-Albaladejo, J., Martín-Fernández, J.A., editors. Springer . 303−327.