Let’s talk about t-SNE

I had heard of t-SNE a while back, but I didn’t have a clear reason to learn about it until I starting playing with Google’s TensorFlow.  Specifically, TensorBoard is visualization tool in TensorFlow that is downright beautiful, and t-SNE is one option for visualizing high dimensional data.  Googlers really like their artsy design choices – check out this presentation from a Google developer, complete with hat and hoodie.  I stuck with some well-trodden ground by creating a neural net to classify MNIST characters.  This is all secondary to the point here, however, because I really want to focus on t-SNE.  Specifically, I want to know the math behind it while using MNIST and TensorBoard for some pretty visuals.  For a purely practical/application perspective of t-SNE, check out this cool article.  The original t-SNE paper is here.

t-SNE applied to 10,000 character MNIST test set.  Left: t-SNE applied to original 784-dimensional data.  Even in the raw data, t-SNE can illustrate the similar structure in similar characters.  The two clusters of yellow are slanty “1”s and vertical “1”s.  Right: t-SNE applied to 10-dimensional feature vectors extracted from a convolutional neural net (1.2% misclass rate).  It’s interesting that t-SNE and my classifier did not group the “1”s tightly, especially since I found them to be grouped well when projected onto a plane.  The smear is structure showing transition from the slanty “1”s to vertical “1”s.  Also for sake of interpretation, due to the non-linear nature of t-SNE, distances do not have a clear connection to Euclidean distance in the original space.

Continue reading

What’s with factor analysis?

A colleague was recently discussing analyzing survey data and mentioned factor analysis (FA).  As he described FA, it sounded much like the ubiquitous principle component analysis (PCA) approach, but this process sometimes goes by other names when applied in different contexts.  I asked how FA differs from PCA.  Apparently I opened a can of worms – PCA and FA are often (incorrectly) used interchangeably, especially in the soft sciences.  Adding to the confusion, I’m told SPSS uses PCA as its default FA method.  Even Wikipedia’s discussion of this specific misconception leaves something to be desired.  While there are variations of FA, I want take my own look at vanilla FA and PCA to get to the fundamental difference in their machinery.

Each point represents an observation in \mathbb{R}^3.  Original data is one dimensional (red on right).  Gaussian measurement noise is added with variance in each dimension as \sigma^2_1 = \sigma^2_2 = 0.01 and \sigma^2_3 = 0.2 (red on left).  We project the data into a two dimensional subspace according to PCA (green in both images) and factor analysis (blue in both images).  Observe how PCA captures much of the variability from the very noisy third measurement while factor analysis more closely captures the original non-noisy data.

Continue reading

James-Stein and Bayes

Suppose you have a sample drawn from a multivariate normal distribution y \sim  N(\theta, a^2 I) in dimension M.  From this observation, you want to find a “good” estimate  for \theta \in \mathbb{R}^M.  We will define our “good” estimate as \hat{\theta} such that expected value of the Euclidean distance between \theta and \hat{\theta} is small.  An obvious and reasonable choice would be to take \hat{\theta} = y.  Surprisingly (at first), there are other estimators which are better.  The most well-known estimator which provably dominates this estimate is the James-Stein estimator:

\hat{\theta}_{JS} = \left(1-\frac{(M-2)a^2}{||y||^2}\right)y.

Notice this shrinks our naive estimate toward the origin.  With this in mind, the surprise (somewhat) fades as the pictures below give pretty clear general intuition.

Each image is 10,000 points drawn from N(\theta,I) in \mathbb{R}^3 projected onto a plane.  Each image has a different mean; \theta = (x,x,x) where x is 0,2,5,10 moving from left to right.  We then shrink each point toward the origin according to the James-Stein estimator.  The blue dots represents points which would improve (are closer to the true mean) if we used James-Stein as opposed to just the original data point.  The red represent points which are worse. In any case more points improve as they are sucked toward the origin, and on average, the James-Stein estimate is better regardless of mean.

Continue reading

Presidential names – popularity predictions

One year ago I wrote this where I used the 1000 most popular baby names each year to find spikes in name popularity.  Inspired by this here (or maybe by stealing her idea), I connected these spikes to real or fictional characters in history.

Somehow I missed that the social security administration allows researchers to download more complete data, data which includes ALL names (almost – there must be at least 5 people given the name).  Further, it’s conveniently accessible here, so let’s do more investigation.  Rather than pull out names with extreme properties as before, let’s try to predict changes in a subset of names.  There were some spikes in popularity for a couple presidential names in our last look (specifically for Woodrow and Grover), so do these spikes occur with most presidents’ names?  If so, can we predict how big the swing in popularity will be?  Turns out, yes and yes.

For each president’s name (no repeat images if multiple presidents had the same name), we take the percentage of babies born with that name in each year.  Taking ratios of this value over the previous year, this shows how the popularity of that name is changing.  For most names, the max or at least a kind of local max, occurs in the year before a president takes office (election year).  So, if we know a presidential candidate’s name, can we predict the size of this popularity swing (assuming they are the candidate elected to be president) using only the data prior to the election year?

Continue reading

Translation please

No interesting data.  No scripts.  No pictures.  Instead I’m going to attack one of my pet peeves.  I was recently working on a project, doing a simple regression, when I wanted to look up something about the variance of an estimated parameter.  I found myself on the Wikipedia page for variance inflation factor.  There was a general description, an equation, but no intuition.  I checked several other top hits in a Google search, and it was the same story.  By intuition, I mean an explanation at a level where there is no need to remember an equation.  Another example is when we divide by one less than the sample size when estimating the variance of a population from a sample, but why?  Answer: “because that makes it an unbiased estimator.”  Okay, but why?  Answer: “because you can calculate the expected value and see it works,” or “that’s how many degrees of freedom you have available for your estimate.”  Those are the answers often given to students, but these still just describe the rule without a fundamental explanation.   These stats concepts can all be expressed in terms of subspaces, orthogonal complements, projections…you know…linear algebra stuff.  This is the language for visualization and intuition (in my opinion).  So, returning to my pet peeve, I hate problems that are confusing only because they are presented in an unfamiliar language.  Today I’m going to take ordinary least squares (OLS) from a stats perspective (degrees of freedom, sums of squares, R^2, prediction intervals, variance inflation factors, etc…) and make it all more intuitive in the language of linear algebra.  OLS is as old as dirt, so I’m sure this all exists somewhere, but it’s certainly less common than the standard presentation, and I may as well put it here for my own reference. Continue reading

Neural net – cats and dogs

In my previous post, I created and trained a neural network with some toy data.  The intent was to do something simple before hitting real-world data.  Here my “real problem” is image classification.  Specifically, I borrowed a problem from Kaggle where we are asked to train a classifier for distinguishing cats from dogs.  I recognize that my best bet would be to tweak an existing network as training larger networks tends to be difficult, but performance is not my goal.  Rather I want to start from the beginning; I plan to build my classifier from scratch (and hopefully learn something in the process).

Samples from Kaggle’s training set for cats and dogs downsampled and made grayscale for input into my neural net.

Continue reading

Neural Net Toy Problems

Neural networks achieve state-of-the-art performance for a variety of machine learning tasks such as speech recognition and image classification.  In fact, if you give verbal commands to your phone, your speech is likely processed using a neural net.  One especially interesting colloquial article about neural networks is here where, given an image, a trained network is used to emphasize features in the image to which the network responds.  These types of results likely require millions of clean training images and large computational resources; I want to create neural nets from scratch and see just how challenging it can be to train and use these nets.  Ignoring large convolutional networks for now (which are typical for these image problems), the first step is to just code up a toy neural net.  Below I’ll talk details and mention some of the issues encountered, but the following image illustrates some of the results.

Left: Training data.  Center: Linear classifier.  Right: Feed forward neural net with one hidden layer containing 5 neurons.

Continue reading