Let’s talk about t-SNE

I had heard of t-SNE a while back, but I didn’t have a clear reason to learn about it until I starting playing with Google’s TensorFlow.  Specifically, TensorBoard is visualization tool in TensorFlow that is downright beautiful, and t-SNE is one option for visualizing high dimensional data.  Googlers really like their artsy design choices – check out this presentation from a Google developer, complete with hat and hoodie.  I stuck with some well-trodden ground by creating a neural net to classify MNIST characters.  This is all secondary to the point here, however, because I really want to focus on t-SNE.  Specifically, I want to know the math behind it while using MNIST and TensorBoard for some pretty visuals.  For a purely practical/application perspective of t-SNE, check out this cool article.  The original t-SNE paper is here.

mnist_tsne
t-SNE applied to 10,000 character MNIST test set.  Left: t-SNE applied to original 784-dimensional data.  Even in the raw data, t-SNE can illustrate the similar structure in similar characters.  The two clusters of yellow are slanty “1”s and vertical “1”s.  Right: t-SNE applied to 10-dimensional feature vectors extracted from a convolutional neural net (1.2% misclass rate).  It’s interesting that t-SNE and my classifier did not group the “1”s tightly, especially since I found them to be grouped well when projected onto a plane.  The smear is structure showing transition from the slanty “1”s to vertical “1”s.  Also for sake of interpretation, due to the non-linear nature of t-SNE, distances do not have a clear connection to Euclidean distance in the original space.

Continue reading

What’s with factor analysis?

A colleague was recently discussing analyzing survey data and mentioned factor analysis (FA).  As he described FA, it sounded much like the ubiquitous principle component analysis (PCA) approach, but this process sometimes goes by other names when applied in different contexts.  I asked how FA differs from PCA.  Apparently I opened a can of worms – PCA and FA are often (incorrectly) used interchangeably, especially in the soft sciences.  Adding to the confusion, I’m told SPSS uses PCA as its default FA method.  Even Wikipedia’s discussion of this specific misconception leaves something to be desired.  While there are variations of FA, I want take my own look at vanilla FA and PCA to get to the fundamental difference in their machinery.

FA
Each point represents an observation in \mathbb{R}^3.  Original data is one dimensional (red on right).  Gaussian measurement noise is added with variance in each dimension as \sigma^2_1 = \sigma^2_2 = 0.01 and \sigma^2_3 = 0.2 (red on left).  We project the data into a two dimensional subspace according to PCA (green in both images) and factor analysis (blue in both images).  Observe how PCA captures much of the variability from the very noisy third measurement while factor analysis more closely captures the original non-noisy data.

Continue reading

James-Stein and Bayes

Suppose you have a sample drawn from a multivariate normal distribution y \sim  N(\theta, a^2 I) in dimension M.  From this observation, you want to find a “good” estimate  for \theta \in \mathbb{R}^M.  We will define our “good” estimate as \hat{\theta} such that expected value of the Euclidean distance between \theta and \hat{\theta} is small.  An obvious and reasonable choice would be to take \hat{\theta} = y.  Surprisingly (at first), there are other estimators which are better.  The most well-known estimator which provably dominates this estimate is the James-Stein estimator:

\hat{\theta}_{JS} = \left(1-\frac{(M-2)a^2}{||y||^2}\right)y.

Notice this shrinks our naive estimate toward the origin.  With this in mind, the surprise (somewhat) fades as the pictures below give pretty clear general intuition.

james_stein
Each image is 10,000 points drawn from N(\theta,I) in \mathbb{R}^3 projected onto a plane.  Each image has a different mean; \theta = (x,x,x) where x is 0,2,5,10 moving from left to right.  We then shrink each point toward the origin according to the James-Stein estimator.  The blue dots represents points which would improve (are closer to the true mean) if we used James-Stein as opposed to just the original data point.  The red represent points which are worse. In any case more points improve as they are sucked toward the origin, and on average, the James-Stein estimate is better regardless of mean.

Continue reading

Presidential names – popularity predictions

One year ago I wrote this where I used the 1000 most popular baby names each year to find spikes in name popularity.  Inspired by this here (or maybe by stealing her idea), I connected these spikes to real or fictional characters in history.

Somehow I missed that the social security administration allows researchers to download more complete data, data which includes ALL names (almost – there must be at least 5 people given the name).  Further, it’s conveniently accessible here, so let’s do more investigation.  Rather than pull out names with extreme properties as before, let’s try to predict changes in a subset of names.  There were some spikes in popularity for a couple presidential names in our last look (specifically for Woodrow and Grover), so do these spikes occur with most presidents’ names?  If so, can we predict how big the swing in popularity will be?  Turns out, yes and yes.

rr_all
For each president’s name (no repeat images if multiple presidents had the same name), we take the percentage of babies born with that name in each year.  Taking ratios of this value over the previous year, this shows how the popularity of that name is changing.  For most names, the max or at least a kind of local max, occurs in the year before a president takes office (election year).  So, if we know a presidential candidate’s name, can we predict the size of this popularity swing (assuming they are the candidate elected to be president) using only the data prior to the election year?

Continue reading

Translation please

No interesting data.  No scripts.  No pictures.  Instead I’m going to attack one of my pet peeves.  I was recently working on a project, doing a simple regression, when I wanted to look up something about the variance of an estimated parameter.  I found myself on the Wikipedia page for variance inflation factor.  There was a general description, an equation, but no intuition.  I checked several other top hits in a Google search, and it was the same story.  By intuition, I mean an explanation at a level where there is no need to remember an equation.  Another example is when we divide by one less than the sample size when estimating the variance of a population from a sample, but why?  Answer: “because that makes it an unbiased estimator.”  Okay, but why?  Answer: “because you can calculate the expected value and see it works,” or “that’s how many degrees of freedom you have available for your estimate.”  Those are the answers often given to students, but these still just describe the rule without a fundamental explanation.   These stats concepts can all be expressed in terms of subspaces, orthogonal complements, projections…you know…linear algebra stuff.  This is the language for visualization and intuition (in my opinion).  So, returning to my pet peeve, I hate problems that are confusing only because they are presented in an unfamiliar language.  Today I’m going to take ordinary least squares (OLS) from a stats perspective (degrees of freedom, sums of squares, R^2, prediction intervals, variance inflation factors, etc…) and make it all more intuitive in the language of linear algebra.  OLS is as old as dirt, so I’m sure this all exists somewhere, but it’s certainly less common than the standard presentation, and I may as well put it here for my own reference. Continue reading

Neural net – cats and dogs

In my previous post, I created and trained a neural network with some toy data.  The intent was to do something simple before hitting real-world data.  Here my “real problem” is image classification.  Specifically, I borrowed a problem from Kaggle where we are asked to train a classifier for distinguishing cats from dogs.  I recognize that my best bet would be to tweak an existing network as training larger networks tends to be difficult, but performance is not my goal.  Rather I want to start from the beginning; I plan to build my classifier from scratch (and hopefully learn something in the process).

cat_dog
Samples from Kaggle’s training set for cats and dogs downsampled and made grayscale for input into my neural net.

Continue reading

Neural Net Toy Problems

Neural networks achieve state-of-the-art performance for a variety of machine learning tasks such as speech recognition and image classification.  In fact, if you give verbal commands to your phone, your speech is likely processed using a neural net.  One especially interesting colloquial article about neural networks is here where, given an image, a trained network is used to emphasize features in the image to which the network responds.  These types of results likely require millions of clean training images and large computational resources; I want to create neural nets from scratch and see just how challenging it can be to train and use these nets.  Ignoring large convolutional networks for now (which are typical for these image problems), the first step is to just code up a toy neural net.  Below I’ll talk details and mention some of the issues encountered, but the following image illustrates some of the results.

multiple
Left: Training data.  Center: Linear classifier.  Right: Feed forward neural net with one hidden layer containing 5 neurons.

Continue reading

Exploring Lyrics Data – Top 100 Songs of All Time by Genre

I would really like to see how well I could train a classifier to recognize song genre based on given lyrics.  Just developing a large data set with the pertinent info seems to be an undertaking.  The million song data set is a great resource, but it does not include genre or lyrics in the original data.  Maybe this is a task for later, but for a Saturday project, getting this data set is too much work.  Instead, I’ll scrape information for the top 100 most popular pop, country, and hip hop songs of all time according to songlyrics.com (relatively small data set, but easier to obtain the data) and simply explore the data.  More specifically, I want to examine which lyrics are most common to songs in each genre.  I don’t care how frequently a word appears in any one song, I just want to look at the total percentage of songs in each genre containing particular lyrics.

All I am going to do here is describe the data; I’ll analyze it more closely in a later post.  Due to broken links, I ended up with 91 hip hop tracks, 87 country songs, and 99 pop tracks.  I removed any articles (a, as, the, etc…) and “me” and “you” as these tend to be the most common words and are uninteresting.  I also removed any lyrics appearing only in a single song.  This produced 3,214 unique terms among 277 songs.  Some basic info for genre is shown below  (the red bar represents +/- one standard deviation):

average

Continue reading

Logistic Regression, Support Vector Machines, and Perceptrons

Logistic regression, support vector machines (SVM), and perceptrons all serve a similar purpose.  Namely, they all play a role in binary classification, but  I’ve never taken the time to closely examine the similarities and differences.  I’m going to use this post to force myself to  examine these three approaches to classification in some detail meaning I’m focusing more on theory than data.  Throughout, I’ll assume \{x_m\}_{m=1}^M is a collection of training vectors in \mathbb{R}^N, each given a label y_m corresponding to one of two classes (y_m\in \{0,1\} or y_m \in \{-1,1\} depending on which classification method we are considering).  So I don’t focus completely on theory, I’ll write some code to classify handwritten digits using each method.  Specifically, I’ll be taking handwritten 2’s and 5’s (chosen for no particular reason) from the usps digit database.  Just for some context, the first 5 entries in the database for each digit look like the following:digits

Continue reading

Classic PageRank

I decided to go with a modern classic for this one: PageRank.  While Google certainly uses many different algorithms for returning search results, the PageRank algorithm was their first.  While other search engines used text content of webpages, urls, or even human editors to rank the importance of pages, Larry Page and Sergey Brin co-founded Google and employed hyperlinks to rank importance.  There is a ton of history here, but the short version is that the ability of PageRank to return useful search results helped Google beat out dozens of other search engines from the 90’s and 00’s.  I thought it would be fun to create a webcrawler to collect links over a small collection of pages and then apply PageRank to see which ones are most important.  I’ll also explore using PageRank as a means to suggest other pages to read.

Starting from the Wikipedia article on Planet Money, vertices show all pages within 2 links of this root.  Left: all pages and links.  Right: pages with links exclusively external to this collection of pages have been removed.  Which page is the most important?
Starting from the Wikipedia article on Planet Money, edges are hyperlinks and vertices show all pages within 2 links of this root. Left: all pages and links. Right: pages with links exclusively external to this collection of vertices have been removed. Which page is the most important?

Continue reading