I am using movielens dataset(ratings.dat), and pandas dataframe to read and process the data. I have to split this data into test and training set. By using pandas dataframe.sample function, and can divide the data into random splits.For example:
train = df.sample(frac=0.8,random_state=200)
test = df.drop(train.index)
Now I am trying to sort data on user_id and then on timestamp, and I need to divide data into 80%-20% per user in training set and test set respectively.
So, for example if user1 rated 10 movies, then the entries for this user should sorted from oldest to latest according to timestamp
ratings = pd.read_csv('filename', sep='\t', engine='python', header=0)
sorted_df = ratings.sort(['user_id', 'timestamp'], ascending=[True, True])
and the splitting should be in such a way that the first 8 entries with oldest timestamp will be in training set and the latest 2 entries will be in the test set.
I have no idea how could I do that. Any suggestions?
Thanks
Data:
user_id item_id rating Timestamp
15 1 539 5 838984068
16 1 586 5 838984068
5 1 355 5 838984474
9 1 370 5 838984596
12 1 466 5 838984679
14 1 520 5 838984679
19 1 594 5 838984679
7 1 362 5 838984885
20 1 616 5 838984941
23 2 260 5 868244562
29 2 733 3 868244562
32 2 786 3 868244562
36 2 1073 3 868244562
33 2 802 2 868244603
38 2 1356 3 868244603
30 2 736 3 868244698
31 2 780 3 868244698
27 2 648 2 868244699