Correspondence analysis explained

Correspondence analysis (CA) is a multivariate statistical technique proposed^[1] by Herman Otto Hartley (Hirschfeld)^[2] and later developed by Jean-Paul Benzécri.^[3] It is conceptually similar to principal component analysis, but applies to categorical rather than continuous data. In a similar manner to principal component analysis, it provides a means of displaying or summarising a set of data in two-dimensional graphical form. Its aim is to display in a biplot any structure hidden in the multivariate setting of the data table. As such it is a technique from the field of multivariate ordination. Since the variant of CA described here can be applied either with a focus on the rows or on the columns it should in fact be called simple (symmetric) correspondence analysis.^[4]

It is traditionally applied to the contingency table of a pair of nominal variables where each cell contains either a count or a zero value. If more than two categorical variables are to be summarized, a variant called multiple correspondence analysis should be chosen instead. CA may also be applied to binary data given the presence/absence coding represents simplified count data i.e. a 1 describes a positive count and 0 stands for a count of zero. Depending on the scores used CA preserves the chi-square distance^[5] ^[6] between either the rows or the columns of the table. Because CA is a descriptive technique, it can be applied to tables regardless of a significant chi-squared test.^[7] ^[8] Although the $\chi^2$ statistic used in inferential statistics and the chi-square distance are computationally related they should not be confused since the latter works as a multivariate statistical distance measure in CA while the $\chi^2$ statistic is in fact a scalar not a metric.^[9]

Details

Like principal components analysis, correspondence analysis creates orthogonal components (or axes) and, for each item in a table i.e. for each row, a set of scores (sometimes called factor scores, see Factor analysis). Correspondence analysis is performed on the data table, conceived as matrix C of size m × n where m is the number of rows and n is the number of columns. In the following mathematical description of the method capital letters in italics refer to a matrix while letters in italics refer to vectors. Understanding the following computations requires knowledge of matrix algebra.

Preprocessing

Before proceeding to the central computational step of the algorithm, the values in matrix C have to be transformed.^[10] First compute a set of weights for the columns and the rows (sometimes called masses),^[7] ^[11] where row and column weights are given by the row and column vectors, respectively:

w_m=

	1
	n_C

C1, w_n=

	1
	n_C

1^TC.

Here

n_C=

	n
\sum
	i=1

	m
\sum
	j=1

C_ij

is the sum of all cell values in matrix C, or short the sum of C, and

is a column vector of ones with the appropriate dimension.

Put in simple words,

w_m

is just a vector whose elements are the row sums of C divided by the sum of C, and

w_n

is a vector whose elements are the column sums of C divided by the sum of C.

The weights are transformed into diagonal matrices

W_m=\operatorname{diag}(1/\sqrt{w_m})

and

W_n=\operatorname{diag}(1/\sqrt{w_n})

where the diagonal elements of

W_n

are

1/\sqrt{w_n}

and those of

W_m

are

1/\sqrt{w_m}

respectively i.e. the vector elements are the inverses of the square roots of the masses. The off-diagonal elements are all 0.

Next, compute matrix

by dividing

by its sum

	1
	n_C

In simple words, Matrix

is just the data matrix (contingency table or binary table) transformed into portions i.e. each cell value is just the cell portion of the sum of the whole table.

Finally, compute matrix

, sometimes called the matrix of standardized residuals,^[10] by matrix multiplication as

S=W_m(P-w_mw_n)W_n

Note, the vectors

w_m

and

w_n

are combined in an outer product resulting in a matrix of the same dimensions as

. In words the formula reads: matrix

\operatorname{outer}(w_m,w_n)

is subtracted from matrix P

and the resulting matrix is scaled (weighted) by the diagonal matrices

W_m

and

W_n

. Multiplying the resulting matrix by the diagonal matrices is equivalent to multiply the i-th row (or column) of it by the i-th element of the diagonal of

W_m

W_n

, respectively^[12] .

Interpretation of preprocessing

The vectors

w_m

and

w_n

are the row and column masses or the marginal probabilities for the rows and columns, respectively. Subtracting matrix

\operatorname{outer}(w_m,w_n)

from matrix P

is the matrix algebra version of double centering the data. Multiplying this difference by the diagonal weighting matrices results in a matrix containing weighted deviations from the origin of a vector space. This origin is defined by matrix

\operatorname{outer}(w_m,w_n)

In fact matrix

\operatorname{outer}(w_m,w_n)

is identical with the matrix of expected frequencies in the chi-squared test. Therefore S

is computationally related to the independence model used in that test. But since CA is not an inferential method the term independence model is inappropriate here.

Orthogonal components

The table

is then decomposed by a singular value decomposition as

S=U\SigmaV^*

where

and

are the left and right singular vectors of

and

\Sigma

is a square diagonal matrix with the singular values

\sigma_i

of S

on the diagonal.

\Sigma

is of dimension

p\leq(min(m,n)-1)

hence

is of dimension m×p and

is of n×p. As orthonormal vectors

and

fulfill

U^*U=V^*V=I

In other words, the multivariate information that is contained in

as well as in S

is now distributed across two (coordinate) matrices

and

and a diagonal (scaling) matrix

\Sigma

. The vector space defined by them has as number of dimensions p, that is the smaller of the two values, number of rows and number of columns, minus 1.

Inertia

While a principal component analysis may be said to decompose the (co)variance, and hence its measure of success is the amount of (co-)variance covered by the first few PCA axes - measured in eigenvalue -, a CA works with a weighted (co-)variance which is called inertia.^[13] The sum of the squared singular values is the total inertia

\Iota

of the data table, computed as

\Iota=

	p
\sum
	i=1

	2.
\sigma
	i

The total inertia

\Iota

of the data table can also computed directly from S

as

\Iota=

	n
\sum
	i=1

	m
\sum
	j=1

	2.
s
	ij

The amount of inertia covered by the i-th set of singular vectors is

\iota_i

, the principal inertia. The higher the portion of inertia covered by the first few singular vectors i.e. the larger the sum of the principal inertiae in comparison to the total inertia, the more successful a CA is. Therefore all principal inertia values are expressed as portion

\epsilon_i

of the total inertia

\epsilon_i=

	2
\sigma
	i

	p
\sum
	i=1

	2
\sigma
	i

and are presented in the form of a scree plot. In fact a scree plot is just a bar plot of all principal inertia portions

\epsilon_i

Coordinates

To transform the singular vectors to coordinates which preserve the chisquare distances between rows or columns an additional weighting step is necessary. The resulting coordinates are called principal coordinates in CA text books. If principal coordinates are used for rows their visualization is called a row isometric^[14] scaling in econometrics and scaling 1^[15] in ecology. Since the weighting includes the singular values

\Sigma

of the matrix of standardized residuals

these coordinates are sometimes referred to as singular value scaled singular vectors, or, a little bit misleading, as eigenvalue scaled eigenvectors. In fact the non-trivial eigenvectors of

SS^*

are the left singular vectors

and those of

S^*S

are the right singular vectors

while the eigenvalues of either of these matrices are the squares of the singular values

\Sigma

. But since all modern algorithms for CA are based on a singular value decomposition this terminology should be avoided. In the french tradition of CA the coordinates are sometimes called (factor) scores.

Factor scores or principal coordinates for the rows of matrix C are computed by

F_m=W_mU\Sigma

i.e. the left singular vectors are scaled by the inverse of the square roots of the row masses and by the singular values. Because principal coordinates are computed using singular values they contain the information about the spread between the rows (or columns) in the original table. Computing the euclidean distances between the entities in principal coordinates results in values that equal their chisquare distances which is the reason why CA is said to "preserve chisquare distances".

Compute principal coordinates for the columns by

F_n=W_nV\Sigma.

To represent the result of CA in a proper biplot, those categories which are not plotted in principal coordinates, i.e. in chisquare distance preserving coordinates, should be plotted in so called standard coordinates. They are called standard coordinates because each vector of standard coordinates has been standardized to exhibit mean 0 and variance 1.^[16] When computing standard coordinates the singular values are omitted which is a direct result of applying the biplot rule by which one of the two sets of singular vector matrices must be scaled by singular values raised to the power of zero i.e. multiplied by one i.e. be computed by omitting the singular values if the other set of singular vectors have been scaled by the singular values. This reassures the existence of a inner product between the two sets of coordinates i.e. it leads to meaningful interpretations of their spatial relations in a biplot.

In practical terms one can think of the standard coordinates as the vertices of the vector space in which the set of principal coordinates (i.e. the respective points) "exists".^[17] The standard coordinates for the rows are

G_m=W_mU

and those for the columns are

G_n=W_nV

Note that a scaling 1 biplot in ecology implies the rows to be in principal and the columns to be in standard coordinates while scaling 2 implies the rows to be in standard and the columns to be in principal coordinates. I.e. scaling 1 implies a biplot of

F_m

together with

G_n

while scaling 2 implies a biplot of

F_n

together with

G_m

Graphical representation of result

The visualization of a CA result always starts with displaying the scree plot of the principal inertia values to evaluate the success of summarizing spread by the first few singular vectors.

The actual ordination is presented in a graph which could - at first look - be confused with a complicated scatter plot. In fact it consists of two scatter plots printed one upon the other, one set of points for the rows and one for the columns. But being a biplot a clear interpretation rule relates the two coordinate matrices used.

Usually the first two dimensions of the CA solution are plotted because they encompass the maximum of information about the data table that can be displayed in 2D although other combinations of dimensions may be investigated by a biplot. A biplot is in fact a low dimensional mapping of a part of the information contained in the original table.

As a rule of thumb that set (rows or columns) which should be analysed with respect to its composition as measured by the other set is displayed in principal coordinates while the other set is displayed in standard coordinates. E.g. a table displaying voting districts in rows and political parties in columns with the cells containing the counted votes may be displayed with the districts (rows) in principal coordinates when the focus is on ordering districts according to similar voting.

Traditionally, originating from the french tradition in CA,^[18] early CA biplots mapped both entities in the same coordinate version, usually principal coordinates, but this kind of display is misleading insofar as: "Although this is called a biplot, it does not have any useful inner product relationship between the row and column scores" as Brian Ripley, maintainer of R package MASS points out correctly.^[19] Today that kind of display should be avoided since laymen usually are not aware of the lacking relation between the two point sets.

A scaling 1 biplot (rows in principal coordinates, columns in standard coordinates) is interpreted as follows:^[20]

The distances between row points approximate their chi-square distance. Points close to each other represent rows with very similar values in the original data table. I.e they may exhibit rather similar frequencies in case of count data or closely related binary values in case of presence/absence data.
(Column) points in standard coordinates represent the vertices of the vector space i.e. the outer corner of something that in multidimensional space has the shape of an irregular polyhedron. Project row points onto the line connecting the origin and the standard coordinate of a column; if the projected position along that connection line is close to the position of the standard coordinate, that row point is strongly associated with this column i.e. in case of count data the row has a high frequency of that category and in case of presence/absence data the row is likely to exhibit a 1 in that column. Row points whose projection would require to elongate the connection line beyond the origin have a lower than average value in that column.

Extensions and applications

Several variants of CA are available, including detrended correspondence analysis (DCA) and canonical correspondence analysis (CCA). The latter (CCA) is used when there is information about possible causes for the similarities between the investigated entities. The extension of correspondence analysis to many categorical variables is called multiple correspondence analysis. An adaptation of correspondence analysis to the problem of discrimination based upon qualitative variables (i.e., the equivalent of discriminant analysis for qualitative data) is called discriminant correspondence analysis or barycentric discriminant analysis.

In the social sciences, correspondence analysis, and particularly its extension multiple correspondence analysis, was made known outside France through French sociologist Pierre Bourdieu's application of it.^[21]

Implementations

The data visualization system Orange include the module: orngCA.
The statistical programming language R includes several packages, which offer a function for (simple symmetric) correspondence analysis. Using the R notation [package_name::function_name] the packages and respective functions are: ade4::dudi.coa, ca::ca, ExPosition::epCA, FactoMineR::CA, MASS::corresp, vegan::cca. The easiest approach for beginners is ca::ca as there is an extensive text book^[22] accompanying that package.
The Freeware PAST (PAleontological STatistics)^[23] offers (simple symmetric) correspondence analysis via the menu "Multivariate/Ordination/Correspondence (CA)".

External links

Greenacre, Michael (2008), La Práctica del Análisis de Correspondencias, BBVA Foundation, Madrid, Spanish translation of Correspondence Analysis in Practice, available for free download from BBVA Foundation publications
Greenacre, Michael (2010), Biplots in Practice, BBVA Foundation, Madrid, available for free download at multivariatestatistics.org

Notes and References

Dodge, Y. (2003) The Oxford Dictionary of Statistical Terms, OUP
Hirschfeld, H.O. (1935) "A connection between correlation and contingency", Proc. Cambridge Philosophical Society, 31, 520 - 524
Book: Benzécri, J.-P. . Dunod . Paris, France . 1973 . L'Analyse des Données. Volume II. L'Analyse des Correspondances.
Book: Beh, Eric. Correspondence Analysis. Theory, Practice and New Strategies. Lombardo. Rosaria. Wiley. 2014. 978-1-119-95324-1. Chichester. 120.
Book: Greenacre, Michael. Correspondence Analysis in Practice. CRC Press. 2007. 9781584886167. Boca Raton. 204.
Book: Legendre, Pierre. Numerical Ecology. Legendre. Louis. Elsevier. 2012. 978-0-444-53868-0. Amsterdam. 465.
Book: Greenacre, Michael . Academic Press . London . 1983 . Theory and Applications of Correspondence Analysis . 0-12-299050-1 .
Book: Greenacre, Michael . Chapman & Hall/CRC . London . 2007 . Correspondence Analysis in Practice, Second Edition .
Book: Greenacre, Michael . Correspondence Analysis in Practice . CRC Press . 2017 . 9781498731775 . 3rd . Boca Raton . 26–29.
Book: Greenacre, Michael. Correspondence Analysis in Practice. CRC Press. 2007. 9781584886167. Boca Raton. 202.
Book: Greenacre, Michael. Correspondence Analysis in Practice, Second Edition. Chapman & Hall/CRC. 2007. London. 202.
Book: Abadir, Karim. Matrix algebra. Magnus. Jan. Cambridge University Press. 2005. 9786612394256. Cambridge. 24.
Book: Beh, Eric. Correspondence Analysis. Theory, Practice and New Strategies. Lombardo. Rosaria. Wiley. 2014. 978-1-119-95324-1. Chichester. 87, 129.
Book: Beh, Eric. Correspondence Analysis. Theory, Practice and New Strategies. Lombardo. Rosaria. Wiley. 2014. 978-1-119-95324-1. Chichester. 132–134.
Book: Legendre, Pierre. Numerical Ecology. Legendre. Louis. Elsevier. 2012. 978-0-444-53868-0. Amsterdam. 470.
Book: Greenacre, Michael . Correspondence Analysis in Practice . CRC Press . 2017 . 9781498731775 . 3rd . Boca Raton . 62.
Book: Blasius, Jörg . Korrespondenzanalyse . Walter de Gruyter . 2001 . 9783486257304 . Berlin . 40, 60 . de.
Book: Greenacre, Michael . Correspondence Analysis in Practice . CRC Press . 2017 . 9781498731775 . 3rd . Boca Raton . 70 . 10.1201/9781315369983.
Web site: Ripley . Brian . 2022-01-13 . MASS R package manual . 2022-03-17 . R Package Documentation (rdrr.io) . Details.
Book: Borcard, Daniel . Numerical Ecology with R . Gillet . Francois . Legendre . Pierre . Springer . 2018 . 9783319714042 . 2nd . Cham . 175 . 10.1007/978-3-319-71404-2.
Book: Bourdieu, Pierre. Distinction. 1984. Routledge. 0674212770. 41.
Book: Greenacre, Michael. Correspondence Analysis in Practice. CRC PRESS. 2021. 9780367782511. third. London.
Web site: Hammer. Øyvind. Past 4 - the Past of the Future. live. 2021-09-14. https://web.archive.org/web/20201101000539/https://www.nhm.uio.no/english/research/infrastructure/past/ . 2020-11-01 .

Correspondence analysis explained

Details

Preprocessing

Interpretation of preprocessing

Orthogonal components

Inertia

Coordinates

Graphical representation of result

Extensions and applications

Implementations

See also

External links

Notes and References