## What’s with factor analysis?

A colleague was recently discussing analyzing survey data and mentioned factor analysis (FA).  As he described FA, it sounded much like the ubiquitous principle component analysis (PCA) approach, but this process sometimes goes by other names when applied in different contexts.  I asked how FA differs from PCA.  Apparently I opened a can of worms – PCA and FA are often (incorrectly) used interchangeably, especially in the soft sciences.  Adding to the confusion, I’m told SPSS uses PCA as its default FA method.  Even Wikipedia’s discussion of this specific misconception leaves something to be desired.  While there are variations of FA, I want take my own look at vanilla FA and PCA to get to the fundamental difference in their machinery. Each point represents an observation in .  Original data is one dimensional (red on right).  Gaussian measurement noise is added with variance in each dimension as and (red on left).  We project the data into a two dimensional subspace according to PCA (green in both images) and factor analysis (blue in both images).  Observe how PCA captures much of the variability from the very noisy third measurement while factor analysis more closely captures the original non-noisy data.

FA and PCA are both dimensionality reduction techniques.  Both take the original data and project onto a smaller subspace – the difference is how that subspace is chosen.  In practice, one usually must pick a basis for the subspace which allows the output to be interpreted, but that’s not the point here.  The main difference is in choosing the subspaces, so how are these different?  Suppose you measure $N$ features, and perform this observation process $M$ times.  We will represent the $M$ observations for each feature measurement together as a column in the tall, skinny $M\times N$ data matrix $X$.  We assume these features are linearly dependent so that the columns comes from some smaller $p$-dimensional subspace plus some random noise.  Specifically, the $n$th column of $X$ may be written as $X_n = \sum_{p=1}^P \ell_{pn} F_p +\epsilon_n$ where $F_p \in \mathbb{R}^M$ and $\epsilon_n \sim N(0,\sigma^2_nI)$.  Note the variance in the noise does not have to be uniform across all features.  More succinctly, we keep everything in terms of matrices: $X= FL + E$.  In FA, we would say the $p$ columns of $F$ are our factors and $L$ are our loadings.  The columns of $F$ just form a basis for the subspace – let’s call it $S$ – and the loadings are the coordinates of the observed data vectors in this basis.  Also, let’s assume the columns of $F$ are orthonormal.  Letting $\Phi_S = FF^T$ be the orthogonal projection onto $S$, we could also write $X = \Phi_S X + E$.  With this notation, let’s look how $S$ is chosen in PCA versus FA.

PCA picks $S$ to minimize the Euclidean distance between the observed data $X$ and the projected data $\Phi_S X$.  While this is simply accomplished via an SVD, we could write such a choice for $S$ as an minimization problem.  We are picking $S$ according to the objective $\min_S \sum_{n=1}^N ||X_n - \Phi_SX_n||^2 = \min_S \sum_{n=1}^N \epsilon_n^T \epsilon_n$

In terms of the Gram matrix $E^T E$, we are minimizing its diagonal.  FA in contrast cares about the relationships among the data rather than the absolute error; for each $n\neq n'$, FA chooses $S$ so that the inner product $X_n^TX_{n'}$ is close to $(\Phi_S X_n)^T\Phi_S X_{n'}$.  Using that $\Phi_S$ is an orthogonal projection and a bit of algebra, FA chooses $S$ according to the objective $\min_S \sum_{n\neq n'}(X_n^TX_{n'}-(\Phi_SX_n)^T\Phi_SX_{n'})^2 = \min_S \sum_{n\neq n'} (\epsilon_n^T \epsilon_{n'})^2$

Going back to the Gram matrix $E^T E$, we are now minimizing everything in the off-diagonal.  I really like how FA and PCA can be phrased in terms of this Gram matrix of the error; PCA aims to make the norms of the error vectors small while FA aims to make these error vectors incoherent.

Through this lens, consider what may happen if you have a nice underlying model but more noise in some measurements than others.  Specifically, lets measure $M=3$ features and take $N=200$ observations.  With no noise, all three measurements across any given observation are set to be identical.  Specifically $X_1 = X_2 = X_3$ which is drawn at random from the unit sphere in $\mathbb{R}^{200}$.  However, we will set $\sigma_1 = \sigma_2 = 0.01$ and $\sigma_3 = 0.2$.  In other words, the data is coming from a one dimensional subspace, and there is some noise associated with measuring each feature, but the noise in the third feature is much larger than the first two.  Let’s perform PCA and FA reducing to two dimensions and plot the results.  The image above shows how PCA does a decent job matching the original data with the noise included.  While the underlying model varies along just a single dimension, the high measurement noise creates another dimension with high variance which PCA also captures.  In contrast, FA reproduces points much closer to the original data without the measurement noise.  The noisy third feature doesn’t effect FA as much because FA aims to maintain similar inner products when comparing raw data with projected data.  The data without measurement noise is strongly coherent – identical in fact.  The extra noise is Gaussian and thus relatively incoherent and doesn’t strongly influence what second dimension is brought into FA.

By expressing PCA and FA in of terms the Gram matrices for error vectors after projecting original data onto lower-dimensional subspaces, we can explicitly describe what sets these two methods apart.  FA maintains relationships among data vectors and implicitly requires some underlying model to be useful.  PCA is simpler and just matches the original data as closely as possible.