I Agree with You About Adventure and Sex, but You Don't Know Anything About Humor or Animals

Predicting User Ratings Based on User Agreement with Third-Party Raters on Individual Attributes

Introduction

An unsolved problem that Machine Learning seems ideally suited to is predicting whether a person will like a particular work of art, or which particular works of art will appeal to a particular person. Existing recommendation systems are often based on categories far too general like genre, on overly specific attributes (e.g. “foreign romantic dystopian science fiction comedies about pastry chefs with a strong female lead”), or on the nebulous concept of “you liked A; this other person liked A; this other person liked B; you will like B”. We have a huge amount of information about movies, about what expert raters (professional critics) think of those movies, and about what movies users like, and yet we still cannot make good machine recommendations. Our goal is to integrate information about movies, critics, and users to create a better predictor of whether a particular person will enjoy a particular movie.


A fair amount of work has already been done on this problem, but as far as we know our approach has not been tried. Our program uses the extent to which a user’s taste aligns with critic’s tastes on particular attributes to recommend movies. Eventually, such a system could be expanded to other media, such as music or visual arts, which can be classified by attribute and have a healthy amount of mainstream criticism attached to them.


Techniques

We compared three techniques: our experimental technique involves using k-nearest neighbor algorithms to predict user ratings based on agreement with critics about individual attributes; we compared this approach to matching users with critics simply based on overall movie ratings, and by averaging user ratings of attributes to predict a new film (not using critic ratings at all). In the first set of conditions we used nearest-neighbor on each individual movie attribute to match users to those critics the user most agrees with on that particular attribute; we generated predicted ratings by averaging across the ratings of the critics matched to the user for the attributes of any given film. In the second set of conditions, we took only overall agreement with critics into account, predicting a user would rate a film the same as the critic most similar to that user in rating the films of the training dataset. In the third condition, we took only user ratings of attributes, not the extent to which they agreed with critics about those attributes: to perform nearest neighbor matching in the first condition, we had to define the extent to which users like movies with each particular attribute; for any given film, we averaged across these preferences for the attributes present in that film to generate a prediction. In the first two cases we tried 1-nearest, 3-nearest, and 5-nearest neighbor, for a total of seven conditions.


Results

Not having access to an existing dataset with the proper level of detail, we built our own dataset of approximately 140 films, and used binary markers to indicate whether each film had any of 40 attributes. Movie attributes include things such as when the film takes place; whether there is sex; gore; if a family is at the center of the film; if the film prominently features sports; general genre; etc. We collected ratings for those films from a dozen critics, scraped from Metacritic.com. We then had several users rate all the movies they had seen. Our validation data consisted of 10% of user-rated films that were randomly withheld from training. We measured error as the Manhattan distance between our predicted rating and the user’s actual rating on each film; by averaging across error from all predicted ratings for all users, we generated average error for each technique.

Figure 1: Average Error by Experimental Technique (lower error = better performance)


Figure 2: Average Error by Experimental Technique by user


Discussion

None of our techniques significantly outperformed the others; all techniques averaged approximately 43 points discrepancy between actual rating and predicted rating. There was a slight general trend of nearest-neighbor performing better with more neighbors; our experimental condition with 5-nearest neighbor outperformed all other conditions (results were not significant). Figure 2 illustrates an interesting trend: in figure 2 users are arranged, left-to-right, in decreasing order of data provided. That is, users who provided more training data had more accurate recommendations. We therefore believe that expanding our training dataset, as well as adding more movie attributes and many more critics, would significantly improve our prediction accuracy.





Link to Detailed Report


Team









Joe Blass

Vijay Murganoor



joeblass@u.northwestern.edu

vijaym123@u.northwestern.edu




Special Thanks to

  • Prof. Doug Downey
  • Chandra Sekhar Bhagavatula
  • Shengxin Zha
  • Kathy Lee