Matrix Factorization with Side Info

David Cortes

This vignette illustrates the usage of the cmfrec library for building recommender systems based on collaborative filtering models for explicit-feedback data, with or without side information about the users and items. Note that the library offers also content-based models and implicit-feedback models, but they are not showcased in this vignette.

This example will use the MovieLens100k data, as bundled in the recommenderlab package, which contains around ~ 100k movie ratings from 943 users about 1664 movies, in a scale from 1 to 5.

In addition to the ratings, it also contains side information about the movies (genre, year of release) and about the users (age, occupation), which will be used here to construct a better recommendation model.

For a more comprehensive introduction see also the cmfrec Python Notebook, which uses the more richer MovieLens1M instead (not provided by R packages).

Matrix Factorization

One of the most popular techniques for building recommender systems is to frame the problem as matrix completion, in which a large sparse matrix is built containing the ratings that users give to products (in this case, movies), with rows representing users, columns representing items, and entries corresponding to the ratings that they’ve given (e.g. “5 stars”). Most of these entries will be missing, as each users is likely to consume only a handful of the available products (thus, the matrix is sparse), and the goal is to construct a model which would be able to predict the value of the known interactions (i.e. predict which rating would each user give to each movie), which is compared against the observed values. The items to recommend to each user are then the ones with highest predicted values among those which the user has not yet consumed.

Typically, the problem is approached by trying to approximate the interactions matrix as the product of two lower-dimension matrices (a.k.a. latent factor matrices), which when multiplied by each other would produce something that resembles the original matrix, having the nice property that it will produce predictions for all user-item combinations - i.e.

\[ \mathbf{X} \approx \mathbf{A} \mathbf{B}^T \] Where:

For a better and more stable model, the \(\mathbf{X}\) matrix is typically centered by substracting its mean, a bias/intercept is added for each user and item, and a regularization penalty is applied to the model matrices and biases (typically on the L2 norm) - i.e.:

\[ \mathbf{X} \approx \mathbf{A} \mathbf{B}^T + \mu + \mathbf{b}_A + \mathbf{b}_B \] Where:

The matrices are typically fitted by initializing them to random numbers and then iteratively updating them in a way that decreases the reconstruction error with respect to the observed entries in \(\mathbf{X}\), using either gradient-based procedures (e.g. stochastic gradient descent) or the ALS (alternating least-squares) method, which optimizes one matrix at a time while leaving the other fixed, performing a few sweeps until convergence.

This library (cmfrec) will by default use the ALS method with L2 regularization, and will use user/item biases which are model parameters (updated at each iteration) rather than being pre-estimated.

Loading the data

The MovieLens100k data is taken from the recommenderlab package. As the data is sparse, it is represented as sparse matrices from the Matrix package. The data comes in CSC format, whereas cmfrec requires COO/triplets format - the conversion is handled by the MatrixExtra package for convenience, which also provides extra slicing functionality that will be used later.

library(cmfrec)
library(Matrix)
library(MatrixExtra)
library(recommenderlab)

data("MovieLense")
X <- as.coo.matrix(MovieLense@data)
str(X)
#> Formal class 'dgTMatrix' [package "Matrix"] with 6 slots
#>   ..@ i       : int [1:99392] 0 1 4 5 9 12 14 15 16 17 ...
#>   ..@ j       : int [1:99392] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..@ Dim     : int [1:2] 943 1664
#>   ..@ Dimnames:List of 2
#>   .. ..$ : chr [1:943] "1" "2" "3" "4" ...
#>   .. ..$ : chr [1:1664] "Toy Story (1995)" "GoldenEye (1995)" "Four Rooms (1995)" "Get Shorty (1995)" ...
#>   ..@ x       : num [1:99392] 5 4 4 4 4 3 1 5 4 5 ...
#>   ..@ factors : list()
Creating a train-test split

In order to evaluate models, 25% of the data will be set as a test set, while the model will be built with the remainder 75%. The split done here is random, but usually time-based splits tend to reflect more realistic scenarios for recommendation.

Typically, these splits are done in such a way that the test set contains only users and items which are in the train set, but such a rule is not necessary and perhaps not even desirable for cmfrec, since it can accomodate global/user/item biases and thus it can make predictions based on them alone.

Classical model

Now fitting the classical matrix factorization model, with global mean centering, user/item biases, L2 regularization which scales with the number of ratings for each user/item, and no side information. This is the model explained in the earlier section: \[ \mathbf{X} \approx \mathbf{A} \mathbf{B}^T + \mu + \mathbf{b}_A + \mathbf{b}_B \]

model.classic <- CMF(X_train, k=25, lambda=0.1, scale_lam=TRUE, verbose=FALSE)

How good is it?

The most typical way of evaluating the quality of these models is by evaluating the error that they have at predicting known entries, which here will be evaluated against the test data that was set apart earlier. The evaluation here will be done in terms of mean squared error (RMSE).

Note that, while widely used in the early literature for recommender systems, RMSE might not provide a good overview of the ranking of items (which is what matters for recommendations), and it’s recommended to also evaluate other metrics such as NDCG@K, P@K, correlations, etc.

i.e. it means that the ratings are off by about one star. This is better than a non-personalized model that would always predict the same rating for each user, which can also be simulated through cmfrec:

(Note: it’s not recommended to use scaled/dynamic regularization in a most-popular model, as it will tend to recommend items with only one user giving the maximum rating.)

Improving the classical model

By default, ALS-based models are broken down to small problems involving linear systems, which are in turned solved through the Conjugate Gradient method, but cmfrec can also use a Cholesky solver for them, which is slower but tends to result in better-quality solutions for explicit-feedback.

As well, the default number of iterations is 10, but can be increased for better models at the expense of longer fitting times.

But more importantly, cmfrec offers the option of adding “implicit-features” or co-factoring, which will additionally factorize binarized versions of \(\mathbf{X}\) (telling whether each entry is missing or not), sharing the same latent components with the factorization of \(\mathbf{X}\) - that is: \[ \mathbf{X} \approx \mathbf{A} \mathbf{B}^T + \mu + \mathbf{b}_A + \mathbf{b}_B \] \[ \mathbf{I}_x \approx \mathbf{A} \mathbf{B}^T_i \:\:\:\: \mathbf{I}^T_x \approx \mathbf{B} \mathbf{A}^T_i \] Where:

#> RMSE for improved classic model is: 0.9128

Adding side information

Collective matrix factorization extends the classical model by incorporating side information about users/items into the formula, which is done by also factorizing the side information matrices, sharing the same latent components that are used for factorizing the \(\mathbf{X}\) matrix: \[ \mathbf{X} \approx \mathbf{A} \mathbf{B}^T + \mu + \mathbf{b}_A + \mathbf{b}_B \] \[ \mathbf{U} \approx \mathbf{A} \mathbf{C}^T + \mu_U \] \[ \mathbf{I} \approx \mathbf{B} \mathbf{D}^T + \mu_I \] \[ \mathbf{I}_x \approx \mathbf{A} \mathbf{B}^T_i \:\:\:\: \mathbf{I}^T_x \approx \mathbf{B} \mathbf{A}^T_i \] Where:

Informally, the latent factors now need to explain both the interactions data as well as the side information, thereby making them generalize better to unseen data. This library in addition allows controlling aspects such as the weight that each factorization has in the optimization objective, different regularization for each matrix, having factors that are not shared, among others.


Fetching the side information from recommenderlab:

U <- MovieLenseUser
U$id      <- NULL
U$zipcode <- NULL
U$age2    <- U$age^2
### Note that `cmfrec` does not standardize features beyond mean centering
U$age     <- (U$age - mean(U$age)) / sd(U$age)
U$age2    <- (U$age2 - mean(U$age2)) / sd(U$age2)
U <- model.matrix(~.-1, data=U)

I <- MovieLenseMeta
I$title <- NULL
I$url   <- NULL
I$year  <- ifelse(is.na(I$year), median(I$year, na.rm=TRUE), I$year)
I$year2 <- I$year^2
I$year  <- (I$year - mean(I$year)) / sd(I$year)
I$year2 <- (I$year2 - mean(I$year2)) / sd(I$year2)
I <- as.coo.matrix(I)

cat(dim(U), "\n")
#> 943 24
cat(dim(I), "\n")
#> 1664 21

Now fitting the model:

model.w.sideinfo <- CMF(X_train, U=U, I=I, NA_as_zero_item=TRUE,
                        k=25, lambda=0.1, scale_lam=TRUE,
                        niter=30, use_cg=FALSE, include_all_X=FALSE,
                        w_main=0.75, w_user=0.5, w_item=0.5, w_implicit=0.5,
                        verbose=FALSE)
pred_side_info <- predict(model.w.sideinfo, X_test)
print_rmse(X_test, pred_side_info, "model with side info")
#> RMSE for model with side info is: 0.9099

Summary:

library(kableExtra)

calc_rmse <- function(X_test, X_hat) {
    return(sqrt(mean( (X_test@x - X_hat@x)^2 )))
}
results <- data.frame(
    NonPersonalized = calc_rmse(X_test, pred_baseline),
    ClassicalModel = calc_rmse(X_test, pred_classic),
    ClassicPlusImplicit = calc_rmse(X_test, pred_improved),
    CollectiveModel = calc_rmse(X_test, pred_side_info)
)
results <- as.data.frame(t(results))
names(results) <- "RMSE"
results %>%
    kable() %>%
    kable_styling()
RMSE
NonPersonalized 0.9460112
ClassicalModel 0.9242721
ClassicPlusImplicit 0.9128036
CollectiveModel 0.9098834

Important to keep in mind:

Generating Top-N recommendations

The goal behind building a collaborative filtering model is typically to be able to make top-N recommended lists for users or to obtain latent factors for an unseen user given its current data. cmfrec has many prediction functions for these purposes depending on what specifically one wants to do, supporting both warm-start and cold-start recommendations.

### Re-fitting the earlier model to all the data,
### this time *without* scaled regularization
model.classic <- CMF(X, k=20, lambda=10, scale_lam=FALSE, verbose=FALSE)
model.w.sideinfo <- CMF(X, U=U, I=I, k=20, lambda=10, scale_lam=FALSE,
                        w_main=0.75, w_user=0.125, w_item=0.125,
                        verbose=FALSE)

Recommendations for existing users

When fitting a model, all the necessary fitted matrices are saved inside the object itself, which allows making predictions for existing users based just on the ID. The specific items consumed by each user are however not saved, so in order to avoid recommending already-seen items, these have to be explicitly passed for exclusion.

Recommendations for new users

The fitted model, as it is, can only provide recommendations for the specific users and items to which it was fit. Typically, one wants to produce recommendations for new users as they go, or update the recommended lists for existing users once they consume more items. cmfrec allows obtaining latent factors and top-N recommended lists for new users without having to refit the whole model.

This is how it would be if user 10 were to come as a new visitor:

It is not mandatory to provide all the side information, as the ratings alone can also be used to generate a recommendation, even if the model was fit with side information (this would not be the case if passing NA_as_zero_user=TRUE):

(In this case, the top-5 recommendations did not change, as the side information has little effect in this particular model, but that might not always be the case - that is, the top-N recommended items for a different user might be different if side information is absent.)

Cold-start recommendations

Conversely, it is also possible to make a recommendation based on the side information without having any rated movies/items. The quality of these recommendations is however highly dependant on the influence that the attributes have in the model, and in this case, the user attributes have very little associated information and thus little leverage.

Nevertheless, they might still provide an improvement over a completely non-personalized recommendation (see Cold-start recommendations in Collective Matrix Factorization):