advanced recommender systems 2-Norm & F-norm Missing as negative & Missing as random Stochastic Gradient Descent Rank BPR: Bayesian Probabilistic Ranking Matrix Factorization SVD++ Adapting SVD++ Matrix formulation Asymmetric SVD Hybrid RS S-SLIM: Side Information Context Aware RS Factorization Machines

SLIM(Sparse Linear Models) techniques: a machine learning approach to ITEM-BASED COLLABORATIVE FILTERING.

Starting from IBCF

The estimation formula can be generalized for the dot product between the URM(User rating matrix, numUsers*numItems) and ISM (Item similarity matrix, numItems*numItems), which return the matrix of all estimated ratings for all users and all items.

Minimize the estimation error

KEY IDEA: Machine Learning Method-write a formula that estimate the error, and minimize the error of model by adjusting the unknown parameters.

In RS, the error is how different our estimation is from the actual ratings of the users, and the unknown parameters are the elements of the similarity matrix.

Here we use the MSE (Mean Square Error). We find the item similarity matrix that minimize this MSE, thus giving the best performance for estimation of ratings.

Optimization problem setting

Condiser MSE. If we have

  1. R: URM
  2. S: Item similarity matrix

And we would like to minimize norm|R-R*S|. If S is the identity matrix, this norm would be 0. Since we would like to find an effective method to calculate S, we add constraint that the diagonal of S are all 0 elements.

2-Norm & F-norm

F-norm

If you measure the quality of your recommendation by using an accuracy metric, the experience is telling us that the F-norm leads to better quality recommendation. If, on the contrary, you measure the quality by using an error metric, such as Mean Square Error, it is better to use the 2-norm.

这里我的理解是2-norm是对所有的URM的非零元素求MSE,F-norm对于URM中的所有元素求MSE。

Missing as negative & Missing as random

F-norm

F-norm is assuming that, if an item in a URM is missing, the rating is zero. As F-norm if trying to convince your model to estimate a zero in URM as a zero, which is the missing as negative.

Quality Metrics for ML

Typically in ML, we always minimize an error function. We apply accuracy metrics like precision, recall, fallout and so on.

Overfitting and Regularization of the model

One way to regularize overfitting is to reduce the number of unknown paramters, and we could add regularization terms to error function to achieve this result.

The Regualtion Term: generally we minimize a function contains the sum of three terms: 1)the real error of our mdoel, 2) the 1-norm and 3) the 2-norm of S. By this we could control the sparsity of the matrix.

Stochastic Gradient Descent

SGD is a general technique for solving optimization problems.

SGD problem & learning rate: SGD may stuck in locl minima, we could increase learning rate to solve this problem. But large learning rate brings other problem.

Rank

The learning to rank approach

Choosing an error function turned into a more general problem: can we find an error function or can we solve a machine learnign problem, to solve the ranking instead of ratings.

Suppose there is a list of items ranked by users, and a list of items ranked by RS. Our goal is to create a list of items with the order defined by the users.

Ranking error metrics: list-wise metric

We could compare two lists and then compare all positions. (This is complex because it is diffcult to compute the gradient)

Ranking error metrics: point-wise metric

We measure exactly the position of an item in the list of users and the list of RS. We measure how much these are different,. (This is also hard to compute the gradient)

Ranking error metrics: pair-wise metric

Take two items, we compare whether the relative order between 2 items are same in two lists. We could optimize the error function by this easily.

BPR: Bayesian Probabilistic Ranking

Definition

Based on pairwise error metrics, BPR is the most famous one, usually works with SGD.

SGD: Normally, when you want to compute the gradient of your error function, you compute the gradient for all the users and all the items in your data set: you compute it which is normally a big matrix and then you can use it to perform the down step, or the up step, depending if you are minimizing or maximizing. Stochastic Gradient Descent works in this way: it is a trick to reduce the computational complexity of each single step and, also, the overfitting risk. Its idea is that you randomly select one user and one item. You compute the gradient only for them. You update the data and, then, you randomly pick another user and another item, performing the same previous steps and so on.

BPR with implicit ratings

We randomly take one user and two items of this user. The two items are item i, that user likes for sure, and item j that the user does not like.

It is easy when deal with explicit ratings. While dealing with implicit items, we are not so sure. Actually we just select the unseen items, it may rise a questions that why we can safely randomly select an item that the user didn’t click and we can bet money saying that the user does not like these items. Actually, it just works, because if you randomly select an item that the user didn’t click, the probability that the user does like this item, is very low. So, in an implicit dataset, you’re going to select the items that the user picked interactively and an item that the user didn’t pick.

BPR error function

We want to estimate the rating of i (liked items) is much bigger than the rating of j (uninterested items).

We would like to maximize the difference. The absolute value of this error function does not work well. Instead we take quantity error. It could be positive or negative. With the help of sigmoid function, we could restrict the error to [0,1].

Explicit rating & Implicit rating

BPR works well with both, but better with implicit ratings. The problem of explicit ratings is the user bias.

In explicit ratings, we select item i (liked items) by high rating items, and item j (uninterested items) by low rating items. We could select item j also those un-rated items. It makes BPR works perfect with explicit items.

Matrix Factorization

MF is a family of Collaborative filtering techiniques. These techniques are useful when URM is sparse. MF usually work better than item-based or user-based collaborative filtering.

KEY IDEA: For each user, we have a vector x which decribes their preference for each attribute k of an item, (the demension of this vector is smaller than the number of items!), and for each item we have a vector y that describes it.

Funk SVD

We could estimate the rating as the dot product of the vector x and y. From a matrix point of view, we are decompositing URM to matrix X and Y.

Latent factors

The dimension of X and Y depends on the designer of the system. X is matrix of numUser*N, and Y is N*numItems. N is the number of latent facotrs.

We usually have N of from 10 to 200. N too small, no useful at all. N too big, X and Y become too sparse.

Avoid Overfitting: add regularization terms for X and Y on the error function.

SVD++

We describe an improved version of funk SVD. Which works for URM with explicit ratings. We need to nornalize the URM, and contains two parts, the Global Effects and Funk SVD.

It take user bias and item bias into consideration.

Assumption on zero elements

The classical formulation of SVD++ minimizes the error function on the non-zero elements.

If the goal is to estimate the real value of the ratings of the user, this is a good appoach. While if the goal is to do a top-n recommendation and desire to have good precision, missing as negative would be better.

Adapting SVD++

In order to avoid recomputing the estimated ratings from scratch for each new user, we need to improve SVD++.

For new user, we would like to calculate the new latent factor. It is telling us how much this user like latent attribute k.

We want to estimate how much user “u” likes latent feature “k” based on the ratings he gave to other items. Now we need something which is telling us if attribute “k” is present or not in item “j”. Yk,j telling us the information of attribute k in item j. Therefore, we can approximate Xu,k as the dot product of the vector of ratings of the user and the latent factors for the corresponding items.

Matrix formulation

We obtain the matrix of estimated ratings R as the product of 2 terms.

  1. The URM, R
  2. The transpose of matrix Y times matrix Y (the dimension of Y is numLatentFactors*numItems), and this product is of shape numItems*numItems

This is quite same to the IBCF. And we use the parameters in Y limit the overfitting problem

Asymmetric SVD

In Asymmetric SVD’s main advantage is that it allows us to overcome the main shortcoming of SVD++, which was that the algorithm is not model-based, thus needs recomputing everytime a new user is added in the system.

With A-SVD, when new users are added into the system, we can start giving them relevant recommendations just by asking for a few of their favorite items.

Instead of using vector x to represent user preferences directly, we introduce another matrix Q. Q multiplied the vector r of recorder user ratigns approximates the vector x. We could klearn Q by ML approaches.

KEY IDEA: we solve how much a user likes a feature, than we solve how much a user likes an item.

Hybrid RS

Hybrid RS combine the results of two or more algorithms together.

Linear Combination

The simplest hybridization, compute the estimated rating for user u and item i as the linear combination of the result of different RS. We could also combine lists.

Disadvantages: the optimal values depends on the dataset.

Pipeline Algorithms

We could use the result of one RS as the input of a second algorithm. It works better if we use a content based technique to enrich a collaborative techinique.

Disadvantages: memory and computation problem.

Merging models

Merge two algorithms, A and B together into one . For example, marge a content based method to calculate the item similarity matrix, and then we use IBCF to calculate another item similarity matrix, we could reach the very same conclusion.

S-SLIM: Side Information

S-SLIM merges collaborative filtering and content based approach.

SLIM is a CF, and the target of CF and CBF are both calculating item similarity matrix.

KEY-IDEA: combine two item similarity matrix together, and use an error function to find the best weights. And we need to constraint the result item similarity matrix have a diagonal with zero.

Context Aware RS

In some application, the RS depends on the external factors like the weather, the hour of the day. These information are called context.

To represent context, we augment the URM to include them as new type of varaible. The URM is not 2d any more, it is 3D! For each rating, we record the rating in different context.

Tensor Factorization

To use the tensor, we extend the matrix factorization to the tensor case. With MF we estimat the URM as product of two matrices, X and Y. In Tensor, we decomposite the 3-d tensor to product of X, Y and Z.

  1. X: how much the user like a latent feature.
  2. Y: how much the latent feature is important to describe the item.
  3. Z: how much the feature is relevant in the context.

Factorization Machines

FM is a new technique of collaborative filtering with side information. It looks like Matrix Factorization, but the technique is quite different.

LINK 1
LINK 2

FM的原理是将所有的user,item都看作输入,将rating看作输出,我们通过这种方式将原本的RS问题转化成了一个regression问题。

FM vs CF

FM is creating a table, and approximating each rating in the table with a formula and estimate the unknown parameters by minimizeing some error, it is just like collaborative filtering.

Advantages

FM is more flexible with side information: just add new columns!

Disadvantages

The computation cost is ahigh, and we need to regularize the overfitting problem.