1
votes

I have a .txt file from epinion data set which is a sparse representation (ie. 23 387 5 represents the fact "user 23 has rated item 387 as 5") . from this sparse format I want to transfer it to its dense Representation scipy so I can do matrix factorization on it.

I have loaded the file with loadtxt() from numpy and it is a [664824, 3] array. Using scipy.sparse.csr_matrix I transfer it to numpy array and using todense() from scipy I was hoping to achieve the dense format but I always get the same matrix: [664824, 3]. How can I turn it into the original [40163,139738] dense representation?

import numpy as np
from io import StringIO

d = np.loadtxt("MFCode/Epinions_dataset.txt") 
S = csr_matrix(d)
D = R.todense()

I expected a dense matrix with the shape of [40163,139738]

1
1) You will need ~ 21GB of memory using int32 2) You would do this using coo_matrix's constructor, which is very natural here. 3) All matrix-factorization techniques i know and implemented in the collaborative-filtering setting (your use-case looks like that), would never build this matrix, but work online on these observations (= rows of user-id, item-id, rating). The term matrix-factorization might be misleading there. - sascha
could you send me a link? my goal is to implement my version of uv decomposition on the .txt dataset . - homa taha
Have you read the sparse documentation for coo or csr formats? csr_matrix(M) makes a sparse matrix from M, assuming M is itself a 2d dense array. The csr_matrix((data, (row, col))) version could use columns from your d matrix. Review the examples in the sparse docs. - hpaulj
Welcome to SO; question has actually nothing to do with machine-learning - kindly do not spam irrelevant tags (removed). - desertnaut

1 Answers

0
votes

A small sample csv like text:

In [218]: np.lib.format.open_memmap?                                            
In [219]: txt = """0 1 3 
     ...: 1 0 4 
     ...: 2 2 5 
     ...: 0 3 6""".splitlines()                                                 
In [220]: data = np.loadtxt(txt)                                                
In [221]: data                                                                  
Out[221]: 
array([[0., 1., 3.],
       [1., 0., 4.],
       [2., 2., 5.],
       [0., 3., 6.]])

Using sparse, using the (data, (row, col)) style of input:

In [222]: from scipy import sparse                                              
In [223]: M = sparse.coo_matrix((data[:,2], (data[:,0], data[:,1])), shape=(5,4))                                                                     
In [224]: M                                                                     
Out[224]: 
<5x4 sparse matrix of type '<class 'numpy.float64'>'
    with 4 stored elements in COOrdinate format>
In [225]: M.A                                                                   
Out[225]: 
array([[0., 3., 0., 6.],
       [4., 0., 0., 0.],
       [0., 0., 5., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

Alternatively fill in a zeros array directly:

In [226]: arr = np.zeros((5,4))                                                 
In [227]: arr[data[:,0].astype(int), data[:,1].astype(int)]=data[:,2]           
In [228]: arr                                                                   
Out[228]: 
array([[0., 3., 0., 6.],
       [4., 0., 0., 0.],
       [0., 0., 5., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

But be ware that np.zeros([40163,139738]) could raise a memory error. M.A (M.toarray())` could also do that.