A colleague was recently discussing analyzing survey data and mentioned factor analysis (FA). As he described FA, it sounded much like the ubiquitous principle component analysis (PCA) approach, but this process sometimes goes by other names when applied in different contexts. I asked how FA differs from PCA. Apparently I opened a can of worms – PCA and FA are often (incorrectly) used interchangeably, especially in the soft sciences. Adding to the confusion, I’m told SPSS uses PCA as its default FA method. Even Wikipedia’s discussion of this specific misconception leaves something to be desired. While there are variations of FA, I want take my own look at vanilla FA and PCA to get to the fundamental difference in their machinery.
FA and PCA are both dimensionality reduction techniques. Both take the original data and project onto a smaller subspace – the difference is how that subspace is chosen. In practice, one usually must pick a basis for the subspace which allows the output to be interpreted, but that’s not the point here. The main difference is in choosing the subspaces, so how are these different? Suppose you measure features, and perform this observation process times. We will represent the observations for each feature measurement together as a column in the tall, skinny data matrix . We assume these features are linearly dependent so that the columns comes from some smaller -dimensional subspace plus some random noise. Specifically, the th column of may be written as where and . Note the variance in the noise does not have to be uniform across all features. More succinctly, we keep everything in terms of matrices: . In FA, we would say the columns of are our factors and are our loadings. The columns of just form a basis for the subspace – let’s call it – and the loadings are the coordinates of the observed data vectors in this basis. Also, let’s assume the columns of are orthonormal. Letting be the orthogonal projection onto , we could also write . With this notation, let’s look how is chosen in PCA versus FA.
PCA picks to minimize the Euclidean distance between the observed data and the projected data . While this is simply accomplished via an SVD, we could write such a choice for as an minimization problem. We are picking according to the objective
In terms of the Gram matrix , we are minimizing its diagonal. FA in contrast cares about the relationships among the data rather than the absolute error; for each , FA chooses so that the inner product is close to . Using that is an orthogonal projection and a bit of algebra, FA chooses according to the objective
Going back to the Gram matrix , we are now minimizing everything in the off-diagonal. I really like how FA and PCA can be phrased in terms of this Gram matrix of the error; PCA aims to make the norms of the error vectors small while FA aims to make these error vectors incoherent.
Through this lens, consider what may happen if you have a nice underlying model but more noise in some measurements than others. Specifically, lets measure features and take observations. With no noise, all three measurements across any given observation are set to be identical. Specifically which is drawn at random from the unit sphere in . However, we will set and . In other words, the data is coming from a one dimensional subspace, and there is some noise associated with measuring each feature, but the noise in the third feature is much larger than the first two. Let’s perform PCA and FA reducing to two dimensions and plot the results. The image above shows how PCA does a decent job matching the original data with the noise included. While the underlying model varies along just a single dimension, the high measurement noise creates another dimension with high variance which PCA also captures. In contrast, FA reproduces points much closer to the original data without the measurement noise. The noisy third feature doesn’t effect FA as much because FA aims to maintain similar inner products when comparing raw data with projected data. The data without measurement noise is strongly coherent – identical in fact. The extra noise is Gaussian and thus relatively incoherent and doesn’t strongly influence what second dimension is brought into FA.
By expressing PCA and FA in of terms the Gram matrices for error vectors after projecting original data onto lower-dimensional subspaces, we can explicitly describe what sets these two methods apart. FA maintains relationships among data vectors and implicitly requires some underlying model to be useful. PCA is simpler and just matches the original data as closely as possible.