0
votes

I have a purchase data (df_temp). I managed to replace using Pandas Dataframe to using a sparse csr_matrix because I have lots of products (89000) which I have to get their user-item information (purchased or not purchased) and then calculate the similarities between products.

First, I converted Pandas DataFrame to Numpy array:

 df_user_product = df_temp[['user_id','product_id']].copy()
 ar1 = np.array(df_user_product.to_records(index=False))

Second, created a coo_matrix because it's known for being fast in sparse matrix construction.

 rows, r_pos = np.unique(ar1['product_id'], return_inverse=True)
 cols, c_pos = np.unique(ar1['user_id'], return_inverse=True)
 s = sparse.coo_matrix((np.ones(r_pos.shape,int), (r_pos, c_pos)))

Third, for matrix calculations, it's better to use csr_matrix or csc_matrix, so I used csr_matrix as I have the product_id(s) in rows => more effective row slicing than csc_matrix.

    sparse_csr_mat = s.tocsr()
    sparse_csr_mat[sparse_csr_mat > 1] = 1

Then, I calculated the cosine similarity between products and put the result in similarities:

import sklearn.preprocessing as pp
col_normed_mat = pp.normalize(sparse_csr_mat, axis=1)
similarities = col_normed_mat * col_normed_mat.T

Which is:

<89447x89447 sparse matrix of type '<type 'numpy.float64'>'
    with 1332945 stored elements in Compressed Sparse Row format>

Now, I want to have at the end a dictionary where for each product, there is the 5 most similar products. How to do it? I don't want to convert the sparse matrix to a dense array because of memory usage constraints. But I also didn't know if there is a way to access the csr_matrix like we do for array where we check for example index=product_id and get all the rows where the index=product_id, that way I will get all the similar products to product_id and sort by cosine similarity value to get the 5 most similar.

For example, a row in similarities matrix:

(product_id1, product_id2) 0.45

how to filter on only the X (=5 in my case) most similar products to product_id1, without having to convert the matrix to an array?

Looking in Stackoverflow, I think lil_matrix can be used for this case? how?

Thanks for the help!

1
A minor point, sparse.csr_matrix(...) takes the same inputs as coo_matrix. coo format is, in effect, a raw data format, which for almost all calculations is converted to csr format. Often that conversion is automatic. But to do things like indexing, you do need the csr format.hpaulj
It might help if you demonstrated what you want to do with a small dense array.hpaulj
You could do a M = ...tolil() and look at the M.rows. That'll be an array of lists of the column indices. Or get the lil of the transpose, to get lists of row indices.hpaulj
There have been questions about getting the top N values per row of sparse matrices, e.g. stackoverflow.com/questions/49207275/…. Same thing applied to a csc can get the top values per column.hpaulj
@hpaulj I don't understand what you mean by automatic conversion of coo_matrix to csr_matrix when doing calculations. In the documentations it's mentioned that coo_matrix is not as effective in calculations and better use csr.SarahData

1 Answers

0
votes

I finally understood how I can get the 5 most similar items to each products and this is by using .tolil() matrix and then convert each row to a numpy array and use argsort to get the 5 most similar items. I used @hpaulj solution suggested in this link.

def max_n(row_data, row_indices, n):
        i = row_data.argsort()[-n:]
        # i = row_data.argpartition(-n)[-n:]
        top_values = row_data[i]
        top_indices = row_indices[i]  # do the sparse indices matter?

        return top_values, top_indices, i

and then I applied it to one row for testing:

top_v, top_ind, ind = max_n(np.array(arr_ll.data[0]),np.array(arr_ll.rows[0]),5)

What I need is the top_indices which are the indices of the 5 most similar products, but those indices are not the real product_id. I mapped them when I constructed the coo_matrix

rows, r_pos = np.unique(ar1['product_id'], return_inverse=True)

But how to get the real product_id back from the indices?

Now for example I have:

top_ind = [2 1 34 9 123]

How to know 2 correspond to what product_id, 1 to what, etc?