I had heard of t-SNE a while back, but I didn’t have a clear reason to learn about it until I starting playing with Google’s TensorFlow. Specifically, TensorBoard is visualization tool in TensorFlow that is downright beautiful, and t-SNE is one option for visualizing high dimensional data. Googlers really like their artsy design choices – check out this presentation from a Google developer, complete with hat and hoodie. I stuck with some well-trodden ground by creating a neural net to classify MNIST characters. This is all secondary to the point here, however, because I really want to focus on t-SNE. Specifically, I want to know the math behind it while using MNIST and TensorBoard for some pretty visuals. For a purely practical/application perspective of t-SNE, check out this cool article. The original t-SNE paper is here.
A colleague was recently discussing analyzing survey data and mentioned factor analysis (FA). As he described FA, it sounded much like the ubiquitous principle component analysis (PCA) approach, but this process sometimes goes by other names when applied in different contexts. I asked how FA differs from PCA. Apparently I opened a can of worms – PCA and FA are often (incorrectly) used interchangeably, especially in the soft sciences. Adding to the confusion, I’m told SPSS uses PCA as its default FA method. Even Wikipedia’s discussion of this specific misconception leaves something to be desired. While there are variations of FA, I want take my own look at vanilla FA and PCA to get to the fundamental difference in their machinery.
Suppose you have a sample drawn from a multivariate normal distribution in dimension . From this observation, you want to find a “good” estimate for . We will define our “good” estimate as such that expected value of the Euclidean distance between and is small. An obvious and reasonable choice would be to take . Surprisingly (at first), there are other estimators which are better. The most well-known estimator which provably dominates this estimate is the James-Stein estimator:
Notice this shrinks our naive estimate toward the origin. With this in mind, the surprise (somewhat) fades as the pictures below give pretty clear general intuition.
One year ago I wrote this where I used the 1000 most popular baby names each year to find spikes in name popularity. Inspired by this here (or maybe by stealing her idea), I connected these spikes to real or fictional characters in history.
Somehow I missed that the social security administration allows researchers to download more complete data, data which includes ALL names (almost – there must be at least 5 people given the name). Further, it’s conveniently accessible here, so let’s do more investigation. Rather than pull out names with extreme properties as before, let’s try to predict changes in a subset of names. There were some spikes in popularity for a couple presidential names in our last look (specifically for Woodrow and Grover), so do these spikes occur with most presidents’ names? If so, can we predict how big the swing in popularity will be? Turns out, yes and yes.
No interesting data. No scripts. No pictures. Instead I’m going to attack one of my pet peeves. I was recently working on a project, doing a simple regression, when I wanted to look up something about the variance of an estimated parameter. I found myself on the Wikipedia page for variance inflation factor. There was a general description, an equation, but no intuition. I checked several other top hits in a Google search, and it was the same story. By intuition, I mean an explanation at a level where there is no need to remember an equation. Another example is when we divide by one less than the sample size when estimating the variance of a population from a sample, but why? Answer: “because that makes it an unbiased estimator.” Okay, but why? Answer: “because you can calculate the expected value and see it works,” or “that’s how many degrees of freedom you have available for your estimate.” Those are the answers often given to students, but these still just describe the rule without a fundamental explanation. These stats concepts can all be expressed in terms of subspaces, orthogonal complements, projections…you know…linear algebra stuff. This is the language for visualization and intuition (in my opinion). So, returning to my pet peeve, I hate problems that are confusing only because they are presented in an unfamiliar language. Today I’m going to take ordinary least squares (OLS) from a stats perspective (degrees of freedom, sums of squares, , prediction intervals, variance inflation factors, etc…) and make it all more intuitive in the language of linear algebra. OLS is as old as dirt, so I’m sure this all exists somewhere, but it’s certainly less common than the standard presentation, and I may as well put it here for my own reference. Continue reading
In my previous post, I created and trained a neural network with some toy data. The intent was to do something simple before hitting real-world data. Here my “real problem” is image classification. Specifically, I borrowed a problem from Kaggle where we are asked to train a classifier for distinguishing cats from dogs. I recognize that my best bet would be to tweak an existing network as training larger networks tends to be difficult, but performance is not my goal. Rather I want to start from the beginning; I plan to build my classifier from scratch (and hopefully learn something in the process).
Neural networks achieve state-of-the-art performance for a variety of machine learning tasks such as speech recognition and image classification. In fact, if you give verbal commands to your phone, your speech is likely processed using a neural net. One especially interesting colloquial article about neural networks is here where, given an image, a trained network is used to emphasize features in the image to which the network responds. These types of results likely require millions of clean training images and large computational resources; I want to create neural nets from scratch and see just how challenging it can be to train and use these nets. Ignoring large convolutional networks for now (which are typical for these image problems), the first step is to just code up a toy neural net. Below I’ll talk details and mention some of the issues encountered, but the following image illustrates some of the results.