Schwarzenegger + Hanks + Sandler + Scorsese = Comedy

Given a list of actors and the director for a movie, how well can you predict the genre of the film? I thought this would be an interesting question to answer.  Along the way, I’ll also explore which actors (or directors) are most important for this prediction.  In some sense, I’m asking who is the quintessential typecast action movie star?  What about for a comedy flick?  Drama? Take a guess, and see if you agree with the results below.

To scope the problem, I decided to consider just three genres: action, comedy, and drama.  The site IMDb contains all kinds of movie info, so I scraped this site for my data.  Specifically, I extracted the top 1000 most popular movies (according to IMDb at least) in each genre.  I’d like to emphasize “according to IMDb” as Daredevil is currently number 7 on the list of action movies.  I also have no idea how they determine genre.  Are you in the mood for a drama?  How about Fast & Furious with Vin Diesel and Paul Walker?  To be fair, IMDb labeled this as action and drama, but such a classifications still seems odd.  Anyway, with each film, I also collected its director and three top actors.  I then put my data into a matrix X=\{X_{m,n}\}_{m=1,n=1}^{M,N}where X_{m,n}=1 if actor/director n is in movie m and X_{m,n}=0 otherwise.  Similarly, for each genre I created a vector of indicator labels.  For example, in the case of action movies, Y=\{Y_m\}_{m=1}^M is such that Y_m=1 if movie m  is an action film and Y_m=0 otherwise.  Of my original 3000 movies I pulled from IMDb, several fall into multiple categories (think of the action-drama Fast & Furious or the action-comedy Rush Hour if you want an example that makes more sense), which I combined into a single entry with multiple genre labels.  I also removed any actors/directors which appeared in only one movie.  All this left me with M=2333 movies and N=1633 actors/directors. Continue reading

Last Word Clustering

I recently discovered the Texas Department of Criminal Justice has a website containing links to the last statements of executed offenders dating from 1982 until the present.  Regardless of your view on capital punishment, I think it’s safe to say that this is a unique, publicly available data set.  I thought an interesting project would be to scrape the data and see how well the statements could be clustered (spoiler alert – not very well).  I didn’t want to read any of the statements ahead of time; rather I wanted to do simple unsupervised clustering and see if the clusters corresponded to any common themes.

After extracting all the data, I had 418 statements.  Removing common articles (“a”, “as”, “the”, etc…), and ignoring any terms present in only a single statement, I was left with 1294 unique words.  I then created a 1294 by 418 matrix A where A_{i,j} records the number of occurrences of word i in statement j.  As some statements are very long (longest was over 7000 words), these statements will bias most clustering methods.  At the same time, I don’t want to scale each column to be unit norm as clustering on the sphere doesn’t make intuitive sense either.  Instead, I scale each column by one over the maximum value in that column.  In this way, we have a penalty for long statements with many repeated words while still not greatly penalizing all long statements.  Just as we accounted for differences in statements, some words are very common and tend to be unimportant when clustering.  We account for this by weighting each row in A.  A natural choice is \ln(418/n_i) where n_i is the number of statements containing term i.  This tends to 0 as a term appears in more and more of the 418 statements.  These types of weighting factors are all rolled up under the name “term frequency-inverse document frequency.”   Our task is now to cluster the columns of A. Continue reading

Names – Popularity, Politics, and Entertainment

A while back I came across a blog by Hilary Parker which examined how the popularity of the name “Hillary” has been in decline.  The popularity of this name doesn’t concern me, but the blog still caught my attention for two reasons:

  1. The top 1000 most popular baby names for every year since 1880 are available on a site run by the social security administration – an interesting data set.
  2. Part of Hillary’s work tied the popularity of names to events in history – this sounds fun.

I thought I’d scrape the data from the social security administration’s website myself and look specifically for male names which show a sudden burst in popularity.  Then, I can play a game where I google the names and try to guess an event or person from history which influenced this popularity boom.

For each year in [1880, 2012] I extracted the top 1000 most popular names and the percentage of babies born with those names.  This ended up being about 3500 unique names.  An intuitive approach to find names with large swings in popularity would be to treat each name as a vector whose components are the percentage of babies born with that name each year, normalize, and then take finite differences to determine which names and years show the greatest change.  The problem here is that when a name no longer appears in the top 1000 names for a year, there is no information available.  Obviously, some percentage of babies are still given this name, but we don’t have this information.  If a name slips into the top 1000 just for a few years, the resulting vector is very sparse; normalizing makes these sparse entries relatively large.  This artificially inflates the finite differences.  Still, plotting a large chunk of the normalized data set gives some indication of what the data looks like.

Normalized popularity for names appearing in the top 1000 most popular names in at least 20 of the last 132 years. Subsequent plots show the average of all names with a name not showing a dramatic burst of popularity (“Jesse”), and the average compared with a name which did suddenly become popular (“Darren”).  Note the different scales.

Continue reading