There’s more art available now than ever before. In 2010, Google Books estimated that there were around 130 million published books. As of March 2022, IMDb’s databases contained 605,284 movies and 222,655 TV series. As of January 2020, MyAnimeList tracked 23,744 anime and 62,056 manga.
The sheer volume makes it hard to find works you’ll actually enjoy. You could:
Each of these approaches forces you—or someone else—to do a lot of work before you find the art you like. I built a transparent recommendation system that avoids these problems. Let’s see how it works.
I used a Bradley-Terry based model (via OpenSkill) to collect pairwise comparisons, rather than traditional ratings (e.g. 5-star ratings or 10-point ratings). Traditional rating systems have problems such as rating inflation and discretization—which inherently discards information—both of which Bradley-Terry based models avoid.
To select item pairs for comparison:
The user chooses which item they prefer or declares a tie. After each comparison, I rescale all ratings to a mean of 50 and a standard deviation of 50/3 to prevent variance drift.
I obtained the data for my movie embeddings from the MovieLens 25M Dataset. To ensure I only considered movies with sufficient popularity, I only considered movies with at least 100 ratings. A useful feature of the MovieLens dataset is that they’ve created tag genomes—a set of 1128 (continuous!) relevance scores (0 to 1) per movie, where each score represents how strongly a tag—like “sci-fi,” “dark humor,” or “nonlinear plot”—applies to that film. I compressed these tags into 159 dimensions via PCA with varimax rotation. Originally, I normalized the embeddings, but I now regard this as a mistake. Normalizing embeddings to unit length (L2 norm) distorts relative feature importance; standardizing (mean=0, std=1) per dimension preserves information.
I obtained the data for my book embeddings from Goodreads Book Graph Datasets. A peculiar feature of Goodreads is their user-generated “shelves”, which users can use to organize/classify books. In practice, they function much like tags, so that is how I will refer to them. To ensure I only considered books with sufficient popularity, I only considered books with at least one 5-star rating and that were added at least 100 times to their 2nd most popular tag. For the 4000 most popular tags, I created five relevant metrics per book-tag pair, based on what MovieLens used to create tag genomes:
After assigning these scores to each tag-book pair, I perform PCA tag-wise, and use the 1st PC score as the tag value for each book. I compressed these tag values into 387 dimensions via PCA with varimax rotation. Originally, I normalized the embeddings, but I now regard this as a mistake. Normalizing embeddings to unit length (L2 norm) distorts relative feature importance; standardizing (mean=0, std=1) per dimension preserves information.
I obtained the data for my anime embeddings from Anime Recommendation Database 2020. This dataset had no fine-grained tag data—there are 42 categories that either apply to the anime or don’t—so I had to use a matrix factorization model to create the embeddings. The model predicts $\hat{r}_{ui}$ as $\mu + b_i + b_u + q_i^Tp_u$ where $\mu$ is the global average, $b_i$ is the item bias, $b_u$ is the user bias, $q_i$ represents the item factors, and $p_u$ represents the user factors. The model minimizes the following equation:
$ \sum_{(u,i)\in\kappa} (r_{ui} - \mu - b_u - b_i - p_u^Tq_i)^2 + \lambda_{p_u}\lVert p_u \rVert^2 + \lambda_{q_i}\lVert q_i \rVert^2 + \lambda_{b_u}(b_u^2) + \lambda_{b_i}(b_i^2)$
Hyperparameters were tuned via random search2. After training the model, I concatenated the item biases with the item factors. Then, I standardize the resulting embeddings column-wise, applied a square-root transform to the factors to normalize skew, then restandardized.
For predictions, I used the scikit-learn Python library. I use ElasticNetCV to select the important (nonzero coefficient) features, followed by RidgeCV to predict the ratings using the selected features.
To solve the cold start problem—the difficulty of providing accurate recommendations when a user has limited data—we can incorporate prior knowledge about what the coefficient vectors (which represent user preferences) typically look like.
In a sense, ridge regression already does this: it assumes the coefficient vector follows a multivariate normal distribution where the means are zero, the features are independent, and all dimensions share the same standard deviation. However, we can improve upon this by using a generalization of ridge regression that allows our prior to be any multivariate normal distribution.
For each user with enough ratings (more than the dimensionality of the embeddings), we perform the following steps:
For the movies dataset, my approach worked well. For example, the “Award-Winning” dimension had a large positive prior mean, while the “Awful” dimension had a large negative one. They also had a strong negative prior correlation with each other, whereas dimensions like “Science Fiction” and “Based on Comic” showed a moderate positive correlation. The prior gives good recommendations, with top choices like Parasite (2019), Won’t You Be My Neighbor? (2018), and The Shawshank Redemption (1994).
For the anime dataset, my approach also performed well, though the results were less interesting. Aside from the first dimension—which corresponds to item biases (essentially how “good” an anime is) and had a positive prior mean—the other prior means were close to zero, as expected given the regularization applied to the item factors. The correlations were also near zero, which was unsurprising. The prior gives good recommendations, with top choices like Fullmetal Alchemist:Brotherhood, Steins;Gate, and Your Name.
For the books dataset, however, my approach failed completely. All prior means were extremely close to zero—even the “Favorites” dimension, which I expected to be strongly positive, was slightly negative. The correlations were also a mess, with none of the expected patterns emerging. The prior gives odd recommendations, with top choices like Deception Point, The Time Traveler’s Wife, and Requiem for the American Dream: The 10 Principles of Concentration of Wealth & Power. I suspect this reflects a deeper issue with Goodreads’s rating system.
I measure expected uncertainty reduction by using the expected decrease in the chance of a draw before vs after the comparison. ↩
Matrix Factorization Hyperparameters: