Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed variables mainly reflect the variations in two unobserved (underlying) variables. Factor analysis searches for such joint variations in response to unobserved latent variables. The observed variables are modelled as linear combinations of the potential factors plus "error" terms, hence factor analysis can be thought of as a special case of errors-in-variables models.[1]
Simply put, the factor loading of a variable quantifies the extent to which the variable is related to a given factor.[2]
A common rationale behind factor analytic methods is that the information gained about the interdependencies between observed variables can be used later to reduce the set of variables in a dataset. Factor analysis is commonly used in psychometrics, personality psychology, biology, marketing, product management, operations research, finance, and machine learning. It may help to deal with data sets where there are large numbers of observed variables that are thought to reflect a smaller number of underlying/latent variables. It is one of the most commonly used inter-dependency techniques and is used when the relevant set of variables shows a systematic inter-dependence and the objective is to find out the latent factors that create a commonality.
The model attempts to explain a set of
p
n
k
fi,j
k<p
k
L\inRp
xi,m-\mui=li,1f1,m+...+li,kfk,m+\varepsiloni,m
where
xi,m
i
m
\mui
i
li,j
i
j
fj,m
j
m
\varepsiloni,m
(i,m)
In matrix notation
X-\Mu=LF+\varepsilon
where observation matrix
X\inRp
L\inRp
F\inRk
\varepsilon\inRp
\Mu\inRp
(i,m)
\Mui,m=\mui
Also we will impose the following assumptions on
F
F
\varepsilon
E(F)=0
E
Cov(F)=I
Cov
I
Suppose
Cov(X-\Mu)=\Sigma
\Sigma=Cov(X-\Mu)=Cov(LF+\varepsilon),
and therefore, from conditions 1 and 2 imposed on
F
E[LF]=LE[F]=0
Cov(LF+\epsilon)=Cov(LF)+Cov(\epsilon)
\Sigma=LCov(F)LT+Cov(\varepsilon),
or, setting
\Psi:=Cov(\varepsilon)
\Sigma=LLT+\Psi.
Q
L\prime= LQ
F\prime=QTF
Suppose a psychologist has the hypothesis that there are two kinds of intelligence, "verbal intelligence" and "mathematical intelligence", neither of which is directly observed. Evidence for the hypothesis is sought in the examination scores from each of 10 different academic fields of 1000 students. If each student is chosen randomly from a large population, then each student's 10 scores are random variables. The psychologist's hypothesis may say that for each of the 10 academic fields, the score averaged over the group of all students who share some common pair of values for verbal and mathematical "intelligences" is some constant times their level of verbal intelligence plus another constant times their level of mathematical intelligence, i.e., it is a linear combination of those two "factors". The numbers for a particular subject, by which the two kinds of intelligence are multiplied to obtain the expected score, are posited by the hypothesis to be the same for all intelligence level pairs, and are called "factor loading" for this subject. For example, the hypothesis may hold that the predicted average student's aptitude in the field of astronomy is
+ .
The numbers 10 and 6 are the factor loadings associated with astronomy. Other academic subjects may have different factor loadings.
Two students assumed to have identical degrees of verbal and mathematical intelligence may have different measured aptitudes in astronomy because individual aptitudes differ from average aptitudes (predicted above) and because of measurement error itself. Such differences make up what is collectively called the "error" — a statistical term that means the amount by which an individual, as measured, differs from what is average for or predicted by his or her levels of intelligence (see errors and residuals in statistics).
The observable data that go into factor analysis would be 10 scores of each of the 1000 students, a total of 10,000 numbers. The factor loadings and levels of the two kinds of intelligence of each student must be inferred from the data.
In the following, matrices will be indicated by indexed variables. "Subject" indices will be indicated using letters
a
b
c
1
p
10
p
q
r
1
k
2
i
j
k
1
N
N=1000
p=10
i
a
xai
xa
xai
z
zai=
xai-\hat\mua | |
\hat\sigmaa |
\hat\mua=\tfrac{1}{N}\sumixai
2=\tfrac{1}{N-1}\sum | |
\hat\sigma | |
i |
(xai
2 | |
-\hat\mu | |
a) |
\begin{matrix}z1,i&=&\ell1,1F1,i&+&\ell1,2F2,i&+&\varepsilon1,i\\ \vdots&&\vdots&&\vdots&&\vdots\\ z10,i&=&\ell10,1F1,i&+&\ell10,2F2,i&+&\varepsilon10,i\end{matrix}
or, more succinctly:
zai=\sump\ellapFpi+\varepsilonai
where
F1i
i
F2i
i
\ellap
a
p=1,2
In matrix notation, we have
Z=LF+\varepsilon
F
1
\sumiFpiFqi=\deltapq
\deltapq
0
p\neq
1
p=q
\sumiFpi\varepsilonai=0
The values of the loadings
L
\mu
\varepsilon
X
F
F
\sumizaizbi=\sumj\ellaj\ellbj+\sumi\varepsilonai\varepsilonbi
(a,b)
p x p
p x N
p
1
2=1-\psi | |
h | |
a=\sum |
j\ellaj\ellaj
zai
Fpi
\ellap
\varepsilon2=\suma\ne\left[\sumizaizbi-\sumj\ellaj\ellbj\right]2
This is equivalent to minimizing the off-diagonal components of the error covariance which, in the model equations have expected values of zero. This is to be contrasted with principal component analysis which seeks to minimize the mean square error of all residuals.[3] Before the advent of high-speed computers, considerable effort was devoted to finding approximate solutions to the problem, particularly in estimating the communalities by other means, which then simplifies the problem considerably by yielding a known reduced correlation matrix. This was then used to estimate the factors and the loadings. With the advent of high-speed computers, the minimization problem can be solved iteratively with adequate speed, and the communalities are calculated in the process, rather than being needed beforehand. The MinRes algorithm is particularly suited to this problem, but is hardly the only iterative means of finding a solution.
If the solution factors are allowed to be correlated (as in 'oblimin' rotation, for example), then the corresponding mathematical model uses skew coordinates rather than orthogonal coordinates.
The parameters and variables of factor analysis can be given a geometrical interpretation. The data (
zai
Fpi
\varepsilonai
N
za
Fp
\boldsymbol{\varepsilon}a
||za||=1
k
za=\sump\ellapFp+\boldsymbol{\varepsilon}a
Fp ⋅ \boldsymbol{\varepsilon}a=0
\hat{z
Fp ⋅ Fq=\deltapq
The data vectors
za
rab=za ⋅ zb
za
zb
1
\hat{r}ab=\hat{z
The goal of factor analysis is to choose the fitting hyperplane such that the reduced correlation matrix reproduces the correlation matrix as nearly as possible, except for the diagonal elements of the correlation matrix which are known to have unit value. In other words, the goal is to reproduce as accurately as possible the cross-correlations in the data. Specifically, for the fitting hyperplane, the mean square error in the off-diagonal components
2=\sum | |
\varepsilon | |
a\neb |
\left(rab-\hat{r}ab\right)2
is to be minimized, and this is accomplished by minimizing it with respect to a set of orthonormal factor vectors. It can be seen that
rab-\hat{r}ab=\boldsymbol{\varepsilon}a ⋅ \boldsymbol{\varepsilon}b
The term on the right is just the covariance of the errors. In the model, the error covariance is stated to be a diagonal matrix and so the above minimization problem will in fact yield a "best fit" to the model: It will yield a sample estimate of the error covariance which has its off-diagonal components minimized in the mean square sense. It can be seen that since the
\hat{z}a
2=||\hat{z | |
{h | |
a} |
Large values of the communalities will indicate that the fitting hyperplane is rather accurately reproducing the correlation matrix. The mean values of the factors must also be constrained to be zero, from which it follows that the mean values of the errors will also be zero.
Exploratory factor analysis (EFA) is used to identify complex interrelationships among items and group items that are part of unified concepts.[4] The researcher makes no a priori assumptions about relationships among factors.[4]
Confirmatory factor analysis (CFA) is a more complex approach that tests the hypothesis that the items are associated with specific factors.[4] CFA uses structural equation modeling to test a measurement model whereby loading on the factors allows for evaluation of relationships between observed variables and unobserved variables.[4] Structural equation modeling approaches can accommodate measurement error and are less restrictive than least-squares estimation.[4] Hypothesized models are tested against actual data, and the analysis would demonstrate loadings of observed variables on the latent variables (factors), as well as the correlation between the latent variables.[4]
Principal component analysis (PCA) is a widely used method for factor extraction, which is the first phase of EFA.[4] Factor weights are computed to extract the maximum possible variance, with successive factoring continuing until there is no further meaningful variance left.[4] The factor model must then be rotated for analysis.[4]
Canonical factor analysis, also called Rao's canonical factoring, is a different method of computing the same model as PCA, which uses the principal axis method. Canonical factor analysis seeks factors that have the highest canonical correlation with the observed variables. Canonical factor analysis is unaffected by arbitrary rescaling of the data.
Common factor analysis, also called principal factor analysis (PFA) or principal axis factoring (PAF), seeks the fewest factors which can account for the common variance (correlation) of a set of variables.
Image factoring is based on the correlation matrix of predicted variables rather than actual variables, where each variable is predicted from the others using multiple regression.
Alpha factoring is based on maximizing the reliability of factors, assuming variables are randomly sampled from a universe of variables. All other methods assume cases to be sampled and variables fixed.
Factor regression model is a combinatorial model of factor model and regression model; or alternatively, it can be viewed as the hybrid factor model,[5] whose factors are partially known.
Researchers wish to avoid such subjective or arbitrary criteria for factor retention as "it made sense to me". A number of objective methods have been developed to solve this problem, allowing users to determine an appropriate range of solutions to investigate.[6] However these different methods often disagree with one another as to the number of factors that ought to be retained. For instance, the parallel analysis may suggest 5 factors while Velicer's MAP suggests 6, so the researcher may request both 5 and 6-factor solutions and discuss each in terms of their relation to external data and theory.
Horn's parallel analysis (PA):[7] A Monte-Carlo based simulation method that compares the observed eigenvalues with those obtained from uncorrelated normal variables. A factor or component is retained if the associated eigenvalue is bigger than the 95th percentile of the distribution of eigenvalues derived from the random data. PA is among the more commonly recommended rules for determining the number of components to retain,[8] but many programs fail to include this option (a notable exception being R).[9] However, Formann provided both theoretical and empirical evidence that its application might not be appropriate in many cases since its performance is considerably influenced by sample size, item discrimination, and type of correlation coefficient.[10]
Velicer's (1976) MAP test[11] as described by Courtney (2013)[12] “involves a complete principal components analysis followed by the examination of a series of matrices of partial correlations” (p. 397 (though this quote does not occur in Velicer (1976) and the cited page number is outside the pages of the citation). The squared correlation for Step “0” (see Figure 4) is the average squared off-diagonal correlation for the unpartialed correlation matrix. On Step 1, the first principal component and its associated items are partialed out. Thereafter, the average squared off-diagonal correlation for the subsequent correlation matrix is then computed for Step 1. On Step 2, the first two principal components are partialed out and the resultant average squared off-diagonal correlation is again computed. The computations are carried out for k minus one step (k representing the total number of variables in the matrix). Thereafter, all of the average squared correlations for each step are lined up and the step number in the analyses that resulted in the lowest average squared partial correlation determines the number of components or factors to retain.[11] By this method, components are maintained as long as the variance in the correlation matrix represents systematic variance, as opposed to residual or error variance. Although methodologically akin to principal components analysis, the MAP technique has been shown to perform quite well in determining the number of factors to retain in multiple simulation studies.[13] [14] This procedure is made available through SPSS's user interface,[12] as well as the psych package for the R programming language.[15] [16]
Kaiser criterion: The Kaiser rule is to drop all components with eigenvalues under 1.0 – this being the eigenvalue equal to the information accounted for by an average single item.[17] The Kaiser criterion is the default in SPSS and most statistical software but is not recommended when used as the sole cut-off criterion for estimating the number of factors as it tends to over-extract factors.[18] A variation of this method has been created where a researcher calculates confidence intervals for each eigenvalue and retains only factors which have the entire confidence interval greater than 1.0.[19] [20]
[21] The Cattell scree test plots the components as the X-axis and the corresponding eigenvalues as the Y-axis. As one moves to the right, toward later components, the eigenvalues drop. When the drop ceases and the curve makes an elbow toward less steep decline, Cattell's scree test says to drop all further components after the one starting at the elbow. This rule is sometimes criticised for being amenable to researcher-controlled "fudging". That is, as picking the "elbow" can be subjective because the curve has multiple elbows or is a smooth curve, the researcher may be tempted to set the cut-off at the number of factors desired by their research agenda.
Variance explained criteria: Some researchers simply use the rule of keeping enough factors to account for 90% (sometimes 80%) of the variation. Where the researcher's goal emphasizes parsimony (explaining variance with as few factors as possible), the criterion could be as low as 50%.
By placing a prior distribution over the number of latent factors and then applying Bayes' theorem, Bayesian models can return a probability distribution over the number of latent factors. This has been modeled using the Indian buffet process,[22] but can be modeled more simply by placing any discrete prior (e.g. a negative binomial distribution) on the number of components.
The output of PCA maximizes the variance accounted for by the first factor first, then the second factor, etc. A disadvantage of this procedure is that most items load on the early factors, while very few items load on later variables. This makes interpreting the factors by reading through a list of questions and loadings difficult, as every question is strongly correlated with the first few components, while very few questions are strongly correlated with the last few components.
Rotation serves to make the output easier to interpret. By choosing a different basis for the same principal componentsthat is, choosing different factors to express the same correlation structureit is possible to create variables that are more easily interpretable.
Rotations can be orthogonal or oblique; oblique rotations allow the factors to correlate.[23] This increased flexibility means that more rotations are possible, some of which may be better at achieving a specified goal. However, this can also make the factors more difficult to interpret, as some information is "double-counted" and included multiple times in different components; some factors may even appear to be near-duplicates of each other.
Two broad classes of orthogonal rotations exist: those that look for sparse rows (where each row is a case, i.e. subject), and those that look for sparse columns (where each column is a variable).
It can be difficult to interpret a factor structure when each variable is loading on multiple factors. Small changes in the data can sometimes tip a balance in the factor rotation criterion so that a completely different factor rotation is produced. This can make it difficult to compare the results of different experiments. This problem is illustrated by a comparison of different studies of world-wide cultural differences. Each study has used different measures of cultural variables and produced a differently rotated factor analysis result. The authors of each study believed that they had discovered something new, and invented new names for the factors they found. A later comparison of the studies found that the results were rather similar when the unrotated results were compared. The common practice of factor rotation has obscured the similarity between the results of the different studies.[24]
Higher-order factor analysis is a statistical method consisting of repeating steps factor analysis – oblique rotation – factor analysis of rotated factors. Its merit is to enable the researcher to see the hierarchical structure of studied phenomena. To interpret the results, one proceeds either by post-multiplying the primary factor pattern matrix by the higher-order factor pattern matrices (Gorsuch, 1983) and perhaps applying a Varimax rotation to the result (Thompson, 1990) or by using a Schmid-Leiman solution (SLS, Schmid & Leiman, 1957, also known as Schmid-Leiman transformation) which attributes the variation from the primary factors to the second-order factors.
See also: Principal component analysis and Exploratory factor analysis.
Factor analysis is related to principal component analysis (PCA), but the two are not identical.[25] There has been significant controversy in the field over differences between the two techniques. PCA can be considered as a more basic version of exploratory factor analysis (EFA) that was developed in the early days prior to the advent of high-speed computers. Both PCA and factor analysis aim to reduce the dimensionality of a set of data, but the approaches taken to do so are different for the two techniques. Factor analysis is clearly designed with the objective to identify certain unobservable factors from the observed variables, whereas PCA does not directly address this objective; at best, PCA provides an approximation to the required factors.[26] From the point of view of exploratory analysis, the eigenvalues of PCA are inflated component loadings, i.e., contaminated with error variance.[27] [28] [29] [30] [31] [32]
Whilst EFA and PCA are treated as synonymous techniques in some fields of statistics, this has been criticised.[33] [34] Factor analysis "deals with the assumption of an underlying causal structure: [it] assumes that the covariation in the observed variables is due to the presence of one or more latent variables (factors) that exert causal influence on these observed variables".[35] In contrast, PCA neither assumes nor depends on such an underlying causal relationship. Researchers have argued that the distinctions between the two techniques may mean that there are objective benefits for preferring one over the other based on the analytic goal. If the factor model is incorrectly formulated or the assumptions are not met, then factor analysis will give erroneous results. Factor analysis has been used successfully where adequate understanding of the system permits good initial model formulations. PCA employs a mathematical transformation to the original data with no assumptions about the form of the covariance matrix. The objective of PCA is to determine linear combinations of the original variables and select a few that can be used to summarize the data set without losing much information.[36]
Fabrigar et al. (1999) address a number of reasons used to suggest that PCA is not equivalent to factor analysis:
Factor analysis takes into account the random error that is inherent in measurement, whereas PCA fails to do so. This point is exemplified by Brown (2009),[37] who indicated that, in respect to the correlation matrices involved in the calculations:
For this reason, Brown (2009) recommends using factor analysis when theoretical ideas about relationships between variables exist, whereas PCA should be used if the goal of the researcher is to explore patterns in their data.
The differences between PCA and factor analysis (FA) are further illustrated by Suhr (2009):
Charles Spearman was the first psychologist to discuss common factor analysis[38] and did so in his 1904 paper.[39] It provided few details about his methods and was concerned with single-factor models.[40] He discovered that school children's scores on a wide variety of seemingly unrelated subjects were positively correlated, which led him to postulate that a single general mental ability, or g, underlies and shapes human cognitive performance.
The initial development of common factor analysis with multiple factors was given by Louis Thurstone in two papers in the early 1930s,[41] [42] summarized in his 1935 book, The Vector of Mind.[43] Thurstone introduced several important factor analysis concepts, including communality, uniqueness, and rotation.[44] He advocated for "simple structure", and developed methods of rotation that could be used as a way to achieve such structure.
In Q methodology, William Stephenson, a student of Spearman, distinguish between R factor analysis, oriented toward the study of inter-individual differences, and Q factor analysis oriented toward subjective intra-individual differences.[45] [46]
Raymond Cattell was a strong advocate of factor analysis and psychometrics and used Thurstone's multi-factor theory to explain intelligence. Cattell also developed the scree test and similarity coefficients.
Factor analysis is used to identify "factors" that explain a variety of results on different tests. For example, intelligence research found that people who get a high score on a test of verbal ability are also good on other tests that require verbal abilities. Researchers explained this by using factor analysis to isolate one factor, often called verbal intelligence, which represents the degree to which someone is able to solve problems involving verbal skills.
Factor analysis in psychology is most often associated with intelligence research. However, it also has been used to find factors in a broad range of domains such as personality, attitudes, beliefs, etc. It is linked to psychometrics, as it can assess the validity of an instrument by finding if the instrument indeed measures the postulated factors.
Factor analysis is a frequently used technique in cross-cultural research. It serves the purpose of extracting cultural dimensions. The best known cultural dimensions models are those elaborated by Geert Hofstede, Ronald Inglehart, Christian Welzel, Shalom Schwartz and Michael Minkov. A popular visualization is Inglehart and Welzel's cultural map of the world.
In an early 1965 study, political systems around the world are examined via factor analysis to construct related theoretical models and research, compare political systems, and create typological categories.[49] For these purposes, in this study seven basic political dimensions are identified, which are related to a wide variety of political behaviour: these dimensions are Access, Differentiation, Consensus, Sectionalism, Legitimation, Interest, and Leadership Theory and Research.
Other political scientists explore the measurement of internal political efficacy using four new questions added to the 1988 National Election Study. Factor analysis is here used to find that these items measure a single concept distinct from external efficacy and political trust, and that these four questions provided the best measure of internal political efficacy up to that point in time.[50]
The basic steps are:
The data collection stage is usually done by marketing research professionals. Survey questions ask the respondent to rate a product sample or descriptions of product concepts on a range of attributes. Anywhere from five to twenty attributes are chosen. They could include things like: ease of use, weight, accuracy, durability, colourfulness, price, or size. The attributes chosen will vary depending on the product being studied. The same question is asked about all the products in the study. The data for multiple products is coded and input into a statistical program such as R, SPSS, SAS, Stata, STATISTICA, JMP, and SYSTAT.
The analysis will isolate the underlying factors that explain the data using a matrix of associations.[51] Factor analysis is an interdependence technique. The complete set of interdependent relationships is examined. There is no specification of dependent variables, independent variables, or causality. Factor analysis assumes that all the rating data on different attributes can be reduced down to a few important dimensions. This reduction is possible because some attributes may be related to each other. The rating given to any one attribute is partially the result of the influence of other attributes. The statistical algorithm deconstructs the rating (called a raw score) into its various components and reconstructs the partial scores into underlying factor scores. The degree of correlation between the initial raw score and the final factor score is called a factor loading.
Factor analysis has also been widely used in physical sciences such as geochemistry, hydrochemistry,[52] astrophysics and cosmology, as well as biological sciences, such as ecology, molecular biology, neuroscience and biochemistry.
In groundwater quality management, it is important to relate the spatial distribution of different chemicalparameters to different possible sources, which have different chemical signatures. For example, a sulfide mine is likely to be associated with high levels of acidity, dissolved sulfates and transition metals. These signatures can be identified as factors through R-mode factor analysis, and the location of possible sources can be suggested by contouring the factor scores.[53]
In geochemistry, different factors can correspond to different mineral associations, and thus to mineralisation.[54]
Factor analysis can be used for summarizing high-density oligonucleotide DNA microarrays data at probe level for Affymetrix GeneChips. In this case, the latent variable corresponds to the RNA concentration in a sample.[55]
Factor analysis has been implemented in several statistical analysis programs since the 1980s:
module scikit-learn[56]
. Karl Gustav Jöreskog . Factor Analysis as an Errors-in-Variables Model . 185–196 . Principals of Modern Psychological Measurement . Hillsdale . Erlbaum . 1983 . 0-89859-277-1 .