I had heard of t-SNE a while back, but I didn’t have a clear reason to learn about it until I starting playing with Google’s TensorFlow. Specifically, TensorBoard is visualization tool in TensorFlow that is downright beautiful, and t-SNE is one option for visualizing high dimensional data. Googlers really like their artsy design choices – check out this presentation from a Google developer, complete with hat and hoodie. I stuck with some well-trodden ground by creating a neural net to classify MNIST characters. This is all secondary to the point here, however, because I really want to focus on t-SNE. Specifically, I want to know the math behind it while using MNIST and TensorBoard for some pretty visuals. For a purely practical/application perspective of t-SNE, check out this cool article. The original t-SNE paper is here.
A colleague was recently discussing analyzing survey data and mentioned factor analysis (FA). As he described FA, it sounded much like the ubiquitous principle component analysis (PCA) approach, but this process sometimes goes by other names when applied in different contexts. I asked how FA differs from PCA. Apparently I opened a can of worms – PCA and FA are often (incorrectly) used interchangeably, especially in the soft sciences. Adding to the confusion, I’m told SPSS uses PCA as its default FA method. Even Wikipedia’s discussion of this specific misconception leaves something to be desired. While there are variations of FA, I want take my own look at vanilla FA and PCA to get to the fundamental difference in their machinery.
Suppose you have a sample drawn from a multivariate normal distribution in dimension . From this observation, you want to find a “good” estimate for . We will define our “good” estimate as such that expected value of the Euclidean distance between and is small. An obvious and reasonable choice would be to take . Surprisingly (at first), there are other estimators which are better. The most well-known estimator which provably dominates this estimate is the James-Stein estimator:
Notice this shrinks our naive estimate toward the origin. With this in mind, the surprise (somewhat) fades as the pictures below give pretty clear general intuition.
One year ago I wrote this where I used the 1000 most popular baby names each year to find spikes in name popularity. Inspired by this here (or maybe by stealing her idea), I connected these spikes to real or fictional characters in history.
Somehow I missed that the social security administration allows researchers to download more complete data, data which includes ALL names (almost – there must be at least 5 people given the name). Further, it’s conveniently accessible here, so let’s do more investigation. Rather than pull out names with extreme properties as before, let’s try to predict changes in a subset of names. There were some spikes in popularity for a couple presidential names in our last look (specifically for Woodrow and Grover), so do these spikes occur with most presidents’ names? If so, can we predict how big the swing in popularity will be? Turns out, yes and yes.
No interesting data. No scripts. No pictures. Instead I’m going to attack one of my pet peeves. I was recently working on a project, doing a simple regression, when I wanted to look up something about the variance of an estimated parameter. I found myself on the Wikipedia page for variance inflation factor. There was a general description, an equation, but no intuition. I checked several other top hits in a Google search, and it was the same story. By intuition, I mean an explanation at a level where there is no need to remember an equation. Another example is when we divide by one less than the sample size when estimating the variance of a population from a sample, but why? Answer: “because that makes it an unbiased estimator.” Okay, but why? Answer: “because you can calculate the expected value and see it works,” or “that’s how many degrees of freedom you have available for your estimate.” Those are the answers often given to students, but these still just describe the rule without a fundamental explanation. These stats concepts can all be expressed in terms of subspaces, orthogonal complements, projections…you know…linear algebra stuff. This is the language for visualization and intuition (in my opinion). So, returning to my pet peeve, I hate problems that are confusing only because they are presented in an unfamiliar language. Today I’m going to take ordinary least squares (OLS) from a stats perspective (degrees of freedom, sums of squares, , prediction intervals, variance inflation factors, etc…) and make it all more intuitive in the language of linear algebra. OLS is as old as dirt, so I’m sure this all exists somewhere, but it’s certainly less common than the standard presentation, and I may as well put it here for my own reference. Continue reading
In my previous post, I created and trained a neural network with some toy data. The intent was to do something simple before hitting real-world data. Here my “real problem” is image classification. Specifically, I borrowed a problem from Kaggle where we are asked to train a classifier for distinguishing cats from dogs. I recognize that my best bet would be to tweak an existing network as training larger networks tends to be difficult, but performance is not my goal. Rather I want to start from the beginning; I plan to build my classifier from scratch (and hopefully learn something in the process).
Neural networks achieve state-of-the-art performance for a variety of machine learning tasks such as speech recognition and image classification. In fact, if you give verbal commands to your phone, your speech is likely processed using a neural net. One especially interesting colloquial article about neural networks is here where, given an image, a trained network is used to emphasize features in the image to which the network responds. These types of results likely require millions of clean training images and large computational resources; I want to create neural nets from scratch and see just how challenging it can be to train and use these nets. Ignoring large convolutional networks for now (which are typical for these image problems), the first step is to just code up a toy neural net. Below I’ll talk details and mention some of the issues encountered, but the following image illustrates some of the results.
I would really like to see how well I could train a classifier to recognize song genre based on given lyrics. Just developing a large data set with the pertinent info seems to be an undertaking. The million song data set is a great resource, but it does not include genre or lyrics in the original data. Maybe this is a task for later, but for a Saturday project, getting this data set is too much work. Instead, I’ll scrape information for the top 100 most popular pop, country, and hip hop songs of all time according to songlyrics.com (relatively small data set, but easier to obtain the data) and simply explore the data. More specifically, I want to examine which lyrics are most common to songs in each genre. I don’t care how frequently a word appears in any one song, I just want to look at the total percentage of songs in each genre containing particular lyrics.
All I am going to do here is describe the data; I’ll analyze it more closely in a later post. Due to broken links, I ended up with 91 hip hop tracks, 87 country songs, and 99 pop tracks. I removed any articles (a, as, the, etc…) and “me” and “you” as these tend to be the most common words and are uninteresting. I also removed any lyrics appearing only in a single song. This produced 3,214 unique terms among 277 songs. Some basic info for genre is shown below (the red bar represents +/- one standard deviation):
Logistic regression, support vector machines (SVM), and perceptrons all serve a similar purpose. Namely, they all play a role in binary classification, but I’ve never taken the time to closely examine the similarities and differences. I’m going to use this post to force myself to examine these three approaches to classification in some detail meaning I’m focusing more on theory than data. Throughout, I’ll assume is a collection of training vectors in , each given a label corresponding to one of two classes ( or depending on which classification method we are considering). So I don’t focus completely on theory, I’ll write some code to classify handwritten digits using each method. Specifically, I’ll be taking handwritten 2’s and 5’s (chosen for no particular reason) from the usps digit database. Just for some context, the first 5 entries in the database for each digit look like the following:
I decided to go with a modern classic for this one: PageRank. While Google certainly uses many different algorithms for returning search results, the PageRank algorithm was their first. While other search engines used text content of webpages, urls, or even human editors to rank the importance of pages, Larry Page and Sergey Brin co-founded Google and employed hyperlinks to rank importance. There is a ton of history here, but the short version is that the ability of PageRank to return useful search results helped Google beat out dozens of other search engines from the 90’s and 00’s. I thought it would be fun to create a webcrawler to collect links over a small collection of pages and then apply PageRank to see which ones are most important. I’ll also explore using PageRank as a means to suggest other pages to read.