LightFM train_interactions shared among train and test sets: This will cause incorrect evaluation, check your data split

Question

tl;dr: Working with Yelp Dataset to make a recommendation System but running into Test interactions matrix and train interactions matrix share 68 interactions. This will cause incorrect evaluation, check your data split. error when running the following LightFM code.

test_auc = auc_score(model,
                    test,
                    #train_interactions=train, #Unable to run with this line uncommented
                    item_features=sparse_features_matrix,
                    num_threads=NUM_THREADS).mean()
print('Hybrid test set AUC: %s' % test_auc)

Full Story: Working with Yelp Dataset to build a recommendation system.

Going off the code provided in example documentation (https://making.lyst.com/lightfm/docs/examples/hybrid_crossvalidated.html) for Hybrid Collaborative Filtering.

I ran my code the following way:

from sklearn.model_selection import train_test_split
from lightfm import LightFM
from scipy import sparse
from lightfm.evaluation import auc_score

train, test = train_test_split(sparse_Rating_Matrix, test_size=0.25,random_state=4)
# Set the number of threads; you can increase this
# if you have more physical cores available.
NUM_THREADS = 2
NUM_COMPONENTS = 100
NUM_EPOCHS = 3
ITEM_ALPHA = 1e-6

# Define a new model instance
model = LightFM(loss='warp',
                item_alpha=ITEM_ALPHA,
                no_components=NUM_COMPONENTS)

# Fit the hybrid model. Note that this time, we pass
# in the item features matrix.
model = model.fit(train,
                item_features=sparse_features_matrix,
                epochs=NUM_EPOCHS,
                num_threads=NUM_THREADS)

# Don't forget the pass in the item features again!
train_auc = auc_score(model,
                      train,
                      item_features=sparse_features_matrix,
                      num_threads=NUM_THREADS).mean()
print('Hybrid training set AUC: %s' % train_auc)

test_auc = auc_score(model,
                    test,
                    #train_interactions=train, # Unable to run with this line uncommented
                    item_features=sparse_features_matrix,
                    num_threads=NUM_THREADS).mean()
print('Hybrid test set AUC: %s' % test_auc)

I had 2 problems:

1) Running the line in question uncommented (train_interactions=train) originally yielded Inconsistent Shape

which was resolved by the following: "test" data set was modified by the following block of code to append a block of zeros below it until the dimensions match that of my train data set (per this recommendation: https://github.com/lyst/lightfm/issues/369):

#Add X users to Test so that the number of rows in Train match Test
N = train.shape[0] #Rows in Train set
n,m = test.shape #Rows & columns in Test set

z = np.zeros([(N-n),m]) #Create the necessary rows of zeros with m columns
test = test.todense() #Temporarily convert Test into a numpy array
test = np.vstack((test,z)) #Vertically stack Test on top of the blank users
test = sparse.csr_matrix(test) #Convert back to sparse

2) After the shape issue was resolved, I tried to implement "train_interactions=train"

But ran into Test interactions matrix and train interactions matrix share 68 interactions. This will cause incorrect evaluation, check your data split.

And I"m not sure how to resolve this 2nd issue. Any ideas?

Details:
-"sparse_features_matrix" is a sparse matrix of {items x categories} where if an item was "Italian" and "Pizza" then the category of "Italian" and "Pizza" would have a value "1" for that item's row ... "0" elsewhere.
-"sparse_Rating_Matrix" is a sparse matrix of {users x items} containing values of the user's ratings to the restaurant (item).

04/08/2020 Update:
LightFM has a whole Database() class object that you should use to prep your data set prior to model evaluation. I found a great github post (https://github.com/lyst/lightfm/issues/494) where user Med-ELOMARI provides an amazing walk through on a small test data set.

When I prepped my data through this method, I was able to add in user_features that I wanted to model (E.g: User_1592 likes "Thai","Mexican","Sushi" cuisines).

Per Turbo's comment, I used LightFM's random_train_test_split method (had originally split my data via sklearn's train_test_split method) and ran the auc_score with the new train/test sets AND the correctly (as far as im aware) prepared model I still run into the same error code:

Input:

%%time
(train,test) = random_train_test_split(lightfm_interactions,test_percentage=0.25) #LightFM's method to split
# Don't forget the pass in the item features again!
train_auc = auc_score(model_users,
                      train,
                      user_features=lightfm_user_features_list,
                      num_threads=NUM_THREADS).mean()
print('User_feature training set AUC: %s' % train_auc)

test_auc = auc_score(model_users,
                    test,
                    #train_interactions=train, #Still can't get this to function
                    user_features=lightfm_user_features_list,
                    num_threads=NUM_THREADS).mean()
print('User_feature test set AUC: %s' % test_auc)

Output if "train_interactions=train" is used:

ValueError: Test interactions matrix and train interactions matrix share 435 interactions. This will cause incorrect evaluation, check your data split.

Good news however is --- by switching from sklearn's train_test_split to LightFM's random_train_test_split my model's AUC score went from 0.49 to 0.96 on training. So I guess it's important to stick with LightFM's methods if available!

Hi. Did you ever end up figuring this out? I'm having the same issue and am using the random_train_test_split method. Thanks in advance — user2382321
Hi, unfortunately I decided to just forego the "train_interactions=train" line command that was giving me a headache. As noted in the update on 08Apr20 -- It seems LightFM has its own datatype for data objects and you might benefit from building your dataset within LightFM's structure instead of something like SciKitLearn. — Shyu
OK thanks for the info. I was using the lightfm methods to format and input the data and was still getting the same error. — user2382321

Turbo Turbo · Accepted Answer · 2020-04-07T15:35:54

LightFM provide a way of splitting your dataset, did you look on it? With it, it might work. https://making.lyst.com/lightfm/docs/cross_validation.html

LightFM train_interactions shared among train and test sets: This will cause incorrect evaluation, check your data split

1 Answers