Given a list of actors and the director for a movie, how well can you predict the genre of the film? I thought this would be an interesting question to answer. Along the way, I’ll also explore which actors (or directors) are most important for this prediction. In some sense, I’m asking who is the quintessential typecast action movie star? What about for a comedy flick? Drama? Take a guess, and see if you agree with the results below.
To scope the problem, I decided to consider just three genres: action, comedy, and drama. The site IMDb contains all kinds of movie info, so I scraped this site for my data. Specifically, I extracted the top 1000 most popular movies (according to IMDb at least) in each genre. I’d like to emphasize “according to IMDb” as Daredevil is currently number 7 on the list of action movies. I also have no idea how they determine genre. Are you in the mood for a drama? How about Fast & Furious with Vin Diesel and Paul Walker? To be fair, IMDb labeled this as action and drama, but such a classifications still seems odd. Anyway, with each film, I also collected its director and three top actors. I then put my data into a matrix where if actor/director is in movie and otherwise. Similarly, for each genre I created a vector of indicator labels. For example, in the case of action movies, is such that if movie is an action film and otherwise. Of my original 3000 movies I pulled from IMDb, several fall into multiple categories (think of the action-drama Fast & Furious or the action-comedy Rush Hour if you want an example that makes more sense), which I combined into a single entry with multiple genre labels. I also removed any actors/directors which appeared in only one movie. All this left me with movies and actors/directors.
As a side note, I initially collected the top 4000 movies in each genre. I later discovered this was a waste time as obscure movie titles just added noise to my problem, or I ended up removing all the actors/director in these films anyway while cleaning the data. Also, analyzing this larger data set without cleaning led to memory and computational complexity issues. These problems may be interesting to address another day, but they don’t really need to be considered for this problem.
This is a multi-label classification problem as a movie may fall into multiple genres. As you may have guessed from how I set up my data, I’m taking a one-vs-rest approach of training three individual classifiers where each classifier is a binary decision on genre (action vs. not action for example). I decided to create each classifier as a random forest (RF); RFs tend to perform remarkably well with minimal assumptions on the data and are thus often a good default choice. Indeed, I trained and validated several other models, and RF won out. I’ll briefly describe the idea behind RFs. The original paper on RFs is here for anyone interested.
RFs combine results form individual tree classifiers. A tree classifier is a greedy algorithm which seeks to iteratively split the data according to the feature (actor/director) which best describes the variance in genre. For illustration, let’s stick with action movies. For fixed , consider the subsets and . We then measure how well the index subsets partition the genre labels. One way to accomplish this is to measure where is the Gini impurity. Gini impurity measures how often a uniformly chosen random element from a set assigned a label according to the distribution of labels in the set is incorrectly classified. Most importantly, it’s small when a set is mostly “pure” (most labels are the same) and achieves its minimum precisely when everything in the set is identical. We thus find and split the data according to the associated actor/director. This leaves us with the two sets and . We grow the decision tree by continuing this process on each subset until every subset of labels has Gini impurity or there are no more features on which to split. Once the tree is grown, given a new vector for actors/director, prediction is performed simply by following a path through the tree according to the rules described by each split. The labels in the terminal node assign the genre or give the probability for the genre if all labels are not the same. Such predictions can be crappy as this approach leads to serious issues with over-fitting; this is offset by making a forest.
A RF trains multiple decision trees as described above, but each tree is trained according to a set drawn at random with replacement from the original set; this is called bootstrap aggregating. Also, for each split, a different random subset of the features is considered. The ensemble of classifiers is then combined into a single classifier by averaging. Intuitively, by bootstrap aggregating, the noise you fit to each tree during training averages out. By considering random subsets of features, you avoid making any single feature important in every tree leading to less correlation between trees (also makes computation more efficient). The ultimate result that the RF classifier performs better than any individual tree. Finally, RFs also provide several measures for the importance of each actor/director in the classification. For example, every time we chose an actor/director to make a split, the Gini impurity of the output at this node is less than that of the input. We can record the change in Gini impurity; big changes reflect important features. Also note actors/directors used to make a split early while growing a tree contribute to the prediction decision for a larger fraction of inputs compared with features at the end of the tree. The fraction of samples for which each actor/director contributes to the prediction multiplied by change in Gini impurity is an estimate on its relative importance. By averaging over all trees in the forest, we have a decent estimate for the importance of each actor/director on our prediction.
For each of my three classifiers, I fit a RF with 200 trees. I uniformly chose 2000 movies with which to train my RF leaving the remaining 333 to test how well my resulting model will predict genre. For each node in the tree, a random choice of 64 actors/directors are considered on which to split. Using the results to predict the 333 movies left out of the training set, the successful prediction rates are as follows:
- Action: 73% successful prediction rate
- Drama: 68% successful prediction rate
- Comedy: 73% successful prediction rate
Seems respectable, especially considering IMDb’s genre classification is sometimes a stretch. So, how important are different actors/directors on these predictions? For each classifier, I took feature importance scores as described above and reordered the actors/director according to importance in decreasing order. Plotting the importance score for each actor/director, these plots looked similar for all classifiers. For example, the plot below shows the top 150 actors/directors in the action movie classifier.
There are clearly several quintessential action movie actors/directors (according to my RF), and then importance levels off quickly. The top 10 important actors/directors for each classifier are below.
Most Important Features:
Action: Arnold Schwarzenegger, Bruce Willis, Jason Statham, Roland Emmerich, Paul Walker, Sylvester Stallone, Sean Connery, Chris Hemsworth, Samuel L. Jackson, Val Kilmer
Drama: Tom Hanks, Martin Scorsese, Demi Moore, Nicole Kidman, Kristen Stewart, Sean Penn, Wes Anderson, Dermot Mulroney, Adrian Lyne, Jake Gyllenhaal
Comedy: Adam Sandler, Ben Stiller, Anna Faris, Seth Rogen, Jennifer Aniston, Cameron Diaz, Steve Carell, Will Ferrell, Jeff Daniels
What about a movie starring Arnold Schwarzenegger, Tom Hanks, Adam Sandler, and directed by Martin Scorsese? Classifiers predict “comedy.” I’d watch that movie.