How to vectorize searching function in Matlab?

Question

Here is a Matlab coding problem (A little different version with intersect not setdiff here:

a rating matrix A with 3 cols, the 1st col is user'ID which maybe duplicated, 2nd col is the item'ID which maybe duplicated, 3rd col is rating from user to item, ranging from 1 to 5.

Now, I have a subset of user IDs smallUserIDList and a subset of item IDs smallItemIDList, then I want to find the rows in A that rated by users in smallUserIDList, and collect the items that user rated, and do some calculations, such as setdiff with smallItemIDList and count the result, as the following code does:

userStat = zeros(length(smallUserIDList), 1);
for i = 1:length(smallUserIDList)
    A2= A(A(:,1) == smallUserIDList(i), :);
    itemIDList_each = unique(A2(:,2));

    setDiff = setdiff(itemIDList_each , smallItemIDList);
    userStat(i) = length(setDiff);
end
userStat

Finally, I find the profile viewer showing that the loop above is inefficient, the question is how to improve this piece of code with vectorization but the help of for loop?

For example:

Input:

A = [
1 11 1
2 22 2
2 66 4
4 44 5
6 66 5
7 11 5
7 77 5
8 11 2
8 22 3
8 44 3
8 66 4
8 77 5    
]

smallUserIDList = [1 2 7 8]
smallItemIDList = [11 22 33 55 77]

Output:

userStat =

 0
 1
 0
 2

It would be good if you added sample data and expected output so that people have something to compare their answers against. — kkuilla
I wonder if it would help if you put the calculation inside the loop in a function - that way the the optimization routine will recognize you only care about userStat and won't copy the other variables into the workspace. — bdecaf
Is it possible that there will be two entries with same userID and same itemID buth with different rating? If not, simply build a sparse matrix. — knedlsepp
@kkuilla Hi! Good idea, I have added an example data and output to make the question more explicit. — archenoo

knedlsepp knedlsepp · Accepted Answer · 2015-04-07T19:02:34

Vanilla MATLAB:

As far as I can tell your code is equivalent to:

%// Create matrix such that: user_item_rating(user,item)==rating
user_item_rating = sparse(A(:,1),A(:,2),A(:,3));

%// Keep all BUT the items in smallItemIDList
user_item_rating(:,smallItemIDList) = [];

%// Keep only those users in `smallUserIDList` and use order of this list
user_item_rating = user_item_rating(smallUserIDList,:);

%// Count the number of ratings
userStat = sum(user_item_rating~=0, 2);

This will work if there is at most one rating per (user,item)-combination. Also it should be quite efficient.

Clean approach without reinventing the wheel:

Check out grpstats from the Statistics Toolbox! An implementation could look similar to this:

%// Create ratings table
ratings = array2table(A, 'VariableNames', {'user','item','rating'});

%// Remove items we don't care about (smallItemIDList)
ratings = ratings(~ismember(ratings.item, smallItemIDList),:);

%// Keep only users we care about (smallUserIDList) 
ratings = ratings(ismember(ratings.user, smallUserIDList),:);

%// Compute the statistics grouped by 'user'. 
userStat = grpstats(ratings, 'user');

How to vectorize searching function in Matlab?

3 Answers

Vanilla MATLAB:

Clean approach without reinventing the wheel:

Benchmarking