Translation please

No interesting data.  No scripts.  No pictures.  Instead I’m going to attack one of my pet peeves.  I was recently working on a project, doing a simple regression, when I wanted to look up something about the variance of an estimated parameter.  I found myself on the Wikipedia page for variance inflation factor.  There was a general description, an equation, but no intuition.  I checked several other top hits in a Google search, and it was the same story.  By intuition, I mean an explanation at a level where there is no need to remember an equation.  Another example is when we divide by one less than the sample size when estimating the variance of a population from a sample, but why?  Answer: “because that makes it an unbiased estimator.”  Okay, but why?  Answer: “because you can calculate the expected value and see it works,” or “that’s how many degrees of freedom you have available for your estimate.”  Those are the answers often given to students, but these still just describe the rule without a fundamental explanation.   These stats concepts can all be expressed in terms of subspaces, orthogonal complements, projections…you know…linear algebra stuff.  This is the language for visualization and intuition (in my opinion).  So, returning to my pet peeve, I hate problems that are confusing only because they are presented in an unfamiliar language.  Today I’m going to take ordinary least squares (OLS) from a stats perspective (degrees of freedom, sums of squares, R^2, prediction intervals, variance inflation factors, etc…) and make it all more intuitive in the language of linear algebra.  OLS is as old as dirt, so I’m sure this all exists somewhere, but it’s certainly less common than the standard presentation, and I may as well put it here for my own reference.

Let’s take a model Y = X \beta + \varepsilon where X is M \times N with the first column in X the all 1’s vector.  We’ll index from 1 to M and 0 to N-1 so that the 0th column will be used to estimate the intercept \beta_0.  We assume the error or noise \varepsilon \sim N(0,I \sigma^2) and will see why the entries are assumed normal and iid in a bit.  To get the OLS solution, we want the closest \hat{y} to y in the column space of X and solve \hat{y}=X\hat{\beta}.  This means the error y- X\hat{\beta} should be orthogonal to the columns of X.  In other words 0=X^T(y- X\hat{\beta}).  Solving for \hat{\beta}, this allows us to estimate our paramters by

 \hat{\beta} = (X^TX)^{-1}X^Ty.

Our vector of predicted values is then

\hat{y}  =X\hat{\beta} = X(X^TX)^{-1}X^Ty

which is the the projection of y onto the column space of X.  Our residuals are

\hat{\varepsilon} = y - \hat{y} = (I- X(X^TX)^{-1}X^T)y

and is the projection of y onto the orthogonal complement of the column space of X.

Let’s turn our attention to the error and residuals.  We would like to estimate the distribution from which \varepsilon was drawn based on \hat{\varepsilon}.  Note if we knew the true parameters \beta, the error \varepsilon = y-X\beta would be drawn from a rotation invariant distribution in \mathbb{R}^M, namely N(0,I\sigma^2) by our model assumptions.  Letting P_{X^c} be the projection onto the orthogonal complement of the column space of X, we see P_{X^c}\varepsilon = P_{X^c}y-P_{X^c}(X\beta) = P_{X^c}y = \hat{\varepsilon}.  So instead of drawing from a normal distribution in \mathbb{R}^M, we drew \hat{\varepsilon} from that distribution projected onto the orthogonal complement of the columns space of X, an M-N dimensional subspace.  Intuitively, this is still a normal distribution, but the variance has decreased.  The question is by how much has it decreased.  It’s not too hard to see \sigma^2 = E[||\varepsilon||^2]/M, and then by rotation invariance, we expect the squared norm of \hat{\varepsilon} to be a fraction of the squared norm of \varepsilon according to what fraction of the original dimension we are in: E[||\varepsilon||^2]=(M/(M-N))E[||\hat{\varepsilon}||^2].  Putting this all together, our unbiased estimate is

\hat{\sigma}^2 = ||\hat{\varepsilon}||^2/(M-N).

So there’s where a statistician would be throwing around “degrees of freedom.”  We divide by M-N degrees of freedom because we’ve projected our error vector onto an M-N dimensional subspace.

Let’s take a closer look at our parameter estimates \hat{\beta}.  We see  \hat{\beta} = (X^TX)^{-1}X^Ty is nothing more than \beta +  (X^TX)^{-1}X^T\epsilon.  Thanks to our mean zero and normal assumptions on \epsilon yet again, this has expected value \beta and is normal, though almost certainly not iid.  In fact, we can calculate

\mbox{cov}[\hat{\beta}] = (X^TX)^{-1}X^T \mbox{cov}[\epsilon][(X^TX)^{-1}X^T]^T = (X^TX)^{-1}\sigma^2

which we estimate by substituting \hat{\sigma}^2 for \sigma^2.  With our parameters normally distributed and these estimates for their means and variances, we can find confidence intervals or do hypothesis tests for the parameters.

With  \mbox{cov}[\hat{\beta}] = (X^TX)^{-1}\hat{\sigma}^2, this says something about how X may effect our parameter estimates.  If the columns of X are such that X^TX is ill-conditioned, the variance for some of our parameter estimates will be large.  This is what statisticians call multicolinearity – which is a name I can get behind as X^TX tends to be ill-conditioned when some columns in X are nearly linear combinations of others. One statistic often mentioned is a variance inflation factor (VIF).  This is typically presented in terms of a statistic obtained by regressing a column of X against the remaining columns, but for the jth parameter (j \neq 0), this algebraically translates to

\mbox{VIF}_j = [X^TP_{1^c}X][j,j] * [(X^T X)^{-1}][j,j]

where P_{1^c} is the projection onto the orthogonal complement of the all-ones column.  Here I am using the notation [j,j] to represent the entry in the jth row and jth column of the given matrix.  The P_{1^c} matrix subtracts the mean from each column of X, so note X^TP_{1^c}X = \mbox{cov}[X].  If the columns of X were orthogonal, for j\neq 0, we would have \mbox{var }\hat{\beta_j} = \hat{\sigma}^2[(X^T X)^{-1}][j,j] =\hat{\sigma}^2/[X^TP_{1^c}X][j,j], and \mbox{VIF } =1.  Thus the VIF is \geq 1 and does as it’s name suggests – it’s a ratio of the estimated variance in \hat{\beta_j} over the variance estimate if the columns of X were orthogonal.  It measures how the variance is “inflated” due to the non-orthogonality of the columns in X.  One aspect of this still feels a little squishy.  Nothing accounts for the all-ones vector .  Indeed the VIF for the \beta_0 term (the intercept) usually is not defined and would equal 0 according to what I have above.  It feels to me that the variance for the \beta_0 estimate would be “inflated” it if is nearly in the span of the other columns.  In fact, I first (incorrectly) calculated \mbox{VIF}_j = [X^TX][j,j] * [(X^T X)^{-1}][j,j] which would fix this issue, but you lose the interpretation of VIF as a ratio of variances.  It seems the VIF is mainly used as a rule of thumb anyway, so I guess it depends on what you care about.  If your intercept estimate is really important, you may not want to use the standard VIF.  This bothered me enough that I looked it up, and these are “centered” vs. “non-centered” versions of VIFs.  The standard definition is “centered,” and the issues are pretty much exactly as I described above.

Moving on to predictions, since \hat{y} = X\hat{\beta}, and the parameter estimates are all normally distributed, \hat{y} is normally distributed with covariance

\mbox{cov}[\hat{y}] = X\mbox{cov}[\hat{\beta}]X^T = \sigma^2 X(X^TX)^{-1}X,

and again we estimate by substituting \hat{\sigma}^2 for \sigma^2.  Note this is the error variance times the projection matrix onto the column space of X.  Unlike the variance of \hat{\beta}, this means the variance of our prediction is effected only by the column space of X and not at all by any near linear dependence among the columns.  In other words, multicolinearity is only an issue with respect to the model parameters and has zero effect on our predictions.  So we again have normal distributions with an estimated variance, and we can set up a confidence interval in the usual manner.  This is a confidence interval for the expected value of the model as we did not account for the \varepsilon term.  If you wanted prediction intervals (which also account for this error in each measurement), just add \hat{\sigma^2}I to the previous covariance matrix for \hat{y}.

Lastly, I want to mention R^2 which is a goodness of fit measure.  People break up their regression into a bunch of different sums of squares (SS), and use these to calculate R^2.  This “break up” of the regression is just y = \hat{y} + \hat{\varepsilon}, and the partition into SS almost corresponds to ||y||^2 = ||\hat{y}||^2 + ||\hat{\varepsilon}||^2.  I say almost because when you look at the definition of sums of squares, these vectors are centered (making the norms akin to variance).  Thus we take the norms above but after projecting everything onto the orthogonal complement of an all 1’s vector.  Recalling that the all 1’s vector is in X, \hat{\varepsilon} is already in this orthogonal complement, and ||P_{1^c}y||^2 = ||P_{1^c}\hat{y}||^2+||\hat{\varepsilon}||^2.  These are referred to as sum squares total, sum squares regression, and sums squares error respectively.  Then

R^2 = ||P_{1^c}y||^2 / ||P_{1^c}\hat{y}||^2.

The closer your model’s prediction \hat{y} to y, the closer R^2 is to 1.  A small R^2 then clearly shows your prediction does not well match the data and consists mostly of the \hat{\epsilon} term.  From just an algebra perspective, it seems unnecessary to project these with P_{1^c}, but this allows the norms to represent variance which in turn makes them useful for hypothesis testing.

There you go – sums of squares, variance inflation factors, degrees of freedom, and other stats jargon in terms of dimensions of subspaces, lengths of vectors, and properties of certain matrices…so much nicer.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s