3
votes

I was following this blog http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/ (Also attaching the matrix here)for the rating prediction using matrix factorization . Initially we have a sparse user-movie matrix R .

enter image description here

We then apply the MF algorithm so as to create a new matrix R' which is the product of 2 matrix P(UxK) and Q(DxK) . We then "minimize" the error in the value given in R and R' .So far so good . But in the final step , when the matrix is filled up , I am not so convinced that these are the predicted values that the user will give . Here is the final matrix:

enter image description here

What is the basis of justification that these are in fact the "predicted" ratings . Also , I am planning to use the P matrix (UxK) as the user's latent features . Can we somehow "justify" that these are infact user's latent features ?

2

2 Answers

0
votes

The justification for using the obtained vectors for each user as latent trait vectors is that using these values of the latent latent traits will minimize the error between the predicted ratings and the actual known ratings.

If you take a look at the predicted ratings and the known ratings in the two diagrams that you posted you can see that the difference between the two matrixes in the cells that are common to both is very small. Example: U1D4 is 1 in the first diagram and 0.98 in the second.

Since the features or user latent trait vector produces good results on the known ratings we think that it would do a good job on predicting the unknown ratings. Of course, we use regularisation to avoid overfitting the training data, but that is the general idea.

0
votes

To evaluate how good your latent feature vectors are you should split your data into training, validation and test.

The training set are the observed ratings that you use to learn your latent features. The validation set is used during learning to tune your model parameters, but but due learning and your test set is used to evaluate your learnt latent features once they are learnt. You can simply set aside a percentage of observed samples for validation and test. If your ratings are time stamped a natural way to select then is but using the most recent samples as validation and test.

More details on splitting your data is here https://link.medium.com/mPpwhdhjknb