Split dataset per user according to timestamp in training and test set in python

Question

I am using movielens dataset(ratings.dat), and pandas dataframe to read and process the data. I have to split this data into test and training set. By using pandas dataframe.sample function, and can divide the data into random splits.For example:

train = df.sample(frac=0.8,random_state=200)

test = df.drop(train.index)

Now I am trying to sort data on user_id and then on timestamp, and I need to divide data into 80%-20% per user in training set and test set respectively.

So, for example if user1 rated 10 movies, then the entries for this user should sorted from oldest to latest according to timestamp

ratings = pd.read_csv('filename', sep='\t', engine='python', header=0)

sorted_df = ratings.sort(['user_id', 'timestamp'], ascending=[True, True])

and the splitting should be in such a way that the first 8 entries with oldest timestamp will be in training set and the latest 2 entries will be in the test set.

I have no idea how could I do that. Any suggestions?

Thanks

Data:

           user_id   item_id   rating   Timestamp 
15              1      539        5  838984068
16              1      586        5  838984068
5               1      355        5  838984474
9               1      370        5  838984596
12              1      466        5  838984679
14              1      520        5  838984679
19              1      594        5  838984679
7               1      362        5  838984885
20              1      616        5  838984941
23              2      260        5  868244562
29              2      733        3  868244562
32              2      786        3  868244562
36              2     1073        3  868244562
33              2      802        2  868244603
38              2     1356        3  868244603
30              2      736        3  868244698
31              2      780        3  868244698
27              2      648        2  868244699

Alessandro Mariani Alessandro Mariani · Accepted Answer · 2017-02-22T15:22:39

It requires multiple step, but can be achieve as follow.

The intuition is to generate a rank according to the time stamp, and constraint it between 0 and 1. Then everything below 0.8 will be your train set, otherwise your test set.

How we do this? Creating the rank is easy as that

df.groupby('user_id')['Timestamp'].rank(method='first')
Out[51]: 
0     1.0
1     2.0
2     3.0
3     4.0
4     5.0
5     6.0
6     7.0
7     8.0
8     9.0
9     1.0
10    2.0
11    3.0
12    4.0
13    5.0
14    6.0
15    7.0
16    8.0
17    9.0
Name: Timestamp, dtype: float64

Then you need to create a mapping between of how many value are in each groups. You can find additional information here: Inplace transformation pandas with groupby.

df['user_id'].map(df.groupby('user_id')['Timestamp'].apply(len))
Out[52]: 
0     9
1     9
2     9
3     9
4     9
5     9
6     9
7     9
8     9
9     9
10    9
11    9
12    9
13    9
14    9
15    9
16    9
17    9
Name: user_id, dtype: int64

Now you can put everything together

ranks = df.groupby('user_id')['Timestamp'].rank(method='first')
counts = df['user_id'].map(df.groupby('user_id')['Timestamp'].apply(len))
(ranks / counts) > 0.8
Out[55]: 
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7      True
8      True
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16     True
17     True
dtype: bool

Split dataset per user according to timestamp in training and test set in python

1 Answers