James-Stein and Bayes

Suppose you have a sample drawn from a multivariate normal distribution y \sim  N(\theta, a^2 I) in dimension M.  From this observation, you want to find a “good” estimate  for \theta \in \mathbb{R}^M.  We will define our “good” estimate as \hat{\theta} such that expected value of the Euclidean distance between \theta and \hat{\theta} is small.  An obvious and reasonable choice would be to take \hat{\theta} = y.  Surprisingly (at first), there are other estimators which are better.  The most well-known estimator which provably dominates this estimate is the James-Stein estimator:

\hat{\theta}_{JS} = \left(1-\frac{(M-2)a^2}{||y||^2}\right)y.

Notice this shrinks our naive estimate toward the origin.  With this in mind, the surprise (somewhat) fades as the pictures below give pretty clear general intuition.

james_stein
Each image is 10,000 points drawn from N(\theta,I) in \mathbb{R}^3 projected onto a plane.  Each image has a different mean; \theta = (x,x,x) where x is 0,2,5,10 moving from left to right.  We then shrink each point toward the origin according to the James-Stein estimator.  The blue dots represents points which would improve (are closer to the true mean) if we used James-Stein as opposed to just the original data point.  The red represent points which are worse. In any case more points improve as they are sucked toward the origin, and on average, the James-Stein estimate is better regardless of mean.

There are many variants of the James-Stein estimator.  For example, there is nothing special about shrinking toward the origin.  Also note the red points in the image on the left; for points with small norms, the multiplier in the James-Stein estimator is negative and actually pushes the estimate away from the origin.  This too can be easily remedied.  It seems the only magic in James-Stein concerns the amount of shrinkage applied.  Where does the multiplier in the James-Stein estimator come from?  We cannot shrink too much or too little.  If we had some a priori idea for what \theta should be, we could chose this as our point to shrink toward and hope to get something like the image on the left as opposed to the one on the right.  This feels very Bayesian.  In fact, the multiplier can be readily obtain from a Bayesian estimate of \theta.

Let’s assume we don’t know \theta, but we do know y|\theta \sim  N(\theta, a^2 I) (our likelihood).  We know this conditional distribution as it represents a known measurement process.  We believe \theta \sim N(0,b^2I) (our prior).  We are taking this prior centered at 0, but the mean of the prior will dictate towards which point your estimate shrinks.  For now, let’s also suppose we somehow know b.  Our goal is then to estimate \theta|y (our posterier).  This is a direct application of Bayes’ theorem which states the pdf of the posterior is proportional to pdfs of likelihood times the prior.  Using this and the pdfs for normal distributions, it is not too hard to derive that

\theta|y \sim N\left(\frac{b^2}{a^2+b^2}y, \frac{a^2b^2}{a^b+b^2}I\right).

With this posterior distribution, we should take our estimate of \theta to be

\hat{\theta}_{Bayes} = \frac{b^2}{a^2+b^2}y = \left(1-\frac{a^2}{a^2+b^2}\right)y.

Indeed we have shrunk our estimate according to our prior, but this isn’t yet the James-Stein estimator.  Recall we assumed we knew b.  What if we have some idea for what \theta may be, but we have no clue about b?  Well, we can estimate b from the data y; this would fit into the category of empirical Bayes methods.

Given y, if our goal is to estimate b, we don’t really care about \theta.  Thus we look at the marginal distribution from which y was drawn.  Again, using the pdfs for our prior and likelihood, we integrate over \theta and discover the marginal distribution y \sim N(0,(a^2+b^2)I) so that

||y||^2 \sim (a^2+b^2)\chi^2_M.

As we want to estimate b, we should take

\hat{b}^2 = E\left[ \frac{||y||^2}{\chi^2_M}-a^2\right] = \frac{||y||^2}{M-2}-a^2

where we had to take the expected value of an inverse chi-squared distribution.  Finally, we substitute our estimate for b into our previous Bayes estimate \hat{\theta}_{Bayes}, and we get the James-Stein estimate:

\hat{\theta}_{JS} = \left(1-\frac{a^2}{a^2+\hat{b}^2}\right)y = \left(1-\frac{(M-2)a^2}{||y||^2}\right)y.

We see the James-Stein estimator is precisely the expected value of the poterior for \theta|y using an empirical Bayes’ estimate.  That’s great, but is it really that much better than the usual maximum likelihood estimate \hat{\theta}_{ML}= y?  Like most things, it depends.  If \theta truly is 0, we can calculate

E\left[||\hat{\theta}_{ML}-0||^2\right] = Ma^2  

E\left[||\hat{\theta}_{Bayes}-0||^2\right] = \left(\frac{b^2}{a^2+b^2}\right)^2Ma^2

E\left[||\hat{\theta}_{JS}-0||^2\right] = 2 a^2.

So if our prior is correct, the savings can be huge in high dimensions.  Of course, the improvement deteriorates with worse priors, but we can calculate more generally

E\left[||\hat{\theta}_{ML}-\theta||^2\right] = Ma^2  

E\left[||\hat{\theta}_{Bayes}-\theta||^2\right] = \left(\frac{b^2}{a^2+b^2}\right)^2Ma^2+\left(\frac{a^2}{a^2+b^2}\right)^2||\theta ||^2

E\left[||\hat{\theta}_{JS}-\theta||^2\right] = \mbox{ ???}

In the Bayes’ estimate we see the effect of a poor prior in the \theta term, but averaging over all \theta, this is an improvement over the max likelihood estimate.  It would be nice to have E\left[||\hat{\theta}_{JS}-\theta||^2\right], but this has proven to be a difficult term to evaluate due to the norm in the denominator.  In fact, all the James-Stein related theorems I’ve found prove the James-Stein estimator dominates the max likelihood estimate for any \theta but say nothing about the magnitude of the improvement in terms of \theta.  Still, in practice, the benefits are often significant which is why the James-Stein estimate is more than just a novelty.

4 thoughts on “James-Stein and Bayes

  1. Søren Fuglede Jørgensen October 18, 2018 / 8:08 pm

    As far as I can tell, the intuition that the James–Stein estimator shrinks the estimate only holds whenever $\lVert y \rVert^2 \geq (m-2)a^2$. Suppose, for instance, that $a^2 = 1$, $m = 3$, and $y = (1e-5, 1e-5, 1e-5)$. Then $\hat{\theta}_{\mathrm{JS}}(y) = -(1e4, 1e4, 1e4)$; the estimate blows up, as $y$ approaches the origin.

    On the other hand, the Bayesian estimator based on $\theta \sim N(0, b^2I)$ clearly shrinks the ML estimate, so the equivalence of the two estimators will also only hold in that same regime. (Indeed, the estimate of $b^2$ becomes negative for $\lVert y \rVert^2 < (m-2)a^2$.)

    Like

    • Søren Fuglede Jørgensen October 18, 2018 / 8:13 pm

      Or, okay, $\hat{\theta}_{\mathrm{JS}}(y) = -(1e4, 1e4, 1e4) + (1e-5, 1e-5, 1e-5)$ but still.

      Like

      • jesse peterson October 18, 2018 / 9:22 pm

        Looks like you didn’t distribute. The y in the numerator will keep the JS estimate from blowing up. The example you are trying to describe is exactly what is happening with the red dots in the image on the far left. For these y’s close to the origin the JS estimate gets pushed away from the origin, but not too far. For this case where you end up “shrinking” toward the true theta (here the origin), the intuition that the estimate shrinks should be interpreted as on average.

        Like

      • Søren Fuglede Jørgensen October 19, 2018 / 9:20 am

        The $y$ in the numerator is not enough to keep the estimator from pushing the estimate arbitrarily far from the origin. For the sake of example, assume that $a^2 = 1$, $m = 3$, and that $\lVert y \rVert^2 = 1/(n+1)$ for some $n > 0$. Then $\lVert \hat{\theta}_{\mathrm{JS}}(y) \rVert = \lVert (1 – (n + 1)) y \rVert = n \lVert y \rVert = n/\sqrt{n+1}$, which tends to infinity as $n \to \infty$. In particular, they certainly can get pushed away “too far” from the origin, but yes, since this happens with low probability, the expectation still improves (but again, the Bayesian interpretation fails).

        Like

Leave a comment