Get norm of numpy sparse matrix rows

Question

I have a sparse matrix that I obtained by using Sklearn's TfidfVectorizer object:

vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', vocabulary=my_vocab, stop_words='english')
tfidf = vect.fit_transform([my_docs])

The sparse matrix is (taking out the numbers for generality):

<sparse matrix of type '<type 'numpy.float64'>'
with stored elements in Compressed Sparse Row format>]

I am trying to get a numeric value for each row to tell me how high a document had the terms I am looking for. I don't really care about which words it contained, I just want to know how many it contained. So I want to get the norm of each or the row*row.T. However, I am having a very hard time working with numpy to obtain this.

My first approach was to just simply do:

tfidf[i] * numpy.transpose(tfidf[i])

However, numpy will apparently not transpose an array with less than one dimension so that will just square the vector. So I tried doing:

tfidf[i] * numpy.transpose(numpy.atleast_2d(tfidf[0]))

But numpy.transpose(numpy.atleast_2d(tfidf[0])) still would not transpose the row.

I moved on to trying to get the norm of the row (that approach is probably better anyways). My initial approach was using numpy.linalg.

numpy.linalg.norm(tfidf[0])

But that gave me a "dimension mismatch" error. So I tried to calculate the norm manually. I started by just setting a variable equal to a numpy array version of the sparse matrix and printing out the len of the first row:

my_array = numpy.array(tfidf)
print my_array
print len(my_array[0])

It prints out my_array correctly, but when I try to access the len it tells me:

IndexError: 0-d arrays can't be indexed

I just simply want to get a numeric value of each row in the sparse matrix returned by fit_transform. Getting the norm would be best. Any help here is very appreciated.

Hi, thanks for responding. I edited my answer to show the type. — Sterling
I can explain the last error: calling np.array(tfidf) makes a 1-element and 0-d array of the entire matrix as an object. To make it into an ordinary array, use tfidf.toarray() or to get a matrix, use tfidf.todense(). — askewchan
Wonderful! That works and I can use linalg.norm() on the rows. Now I just need to figure out how to not make each row's norm equal 1. I thought that the entire tfidf matrix was being normalized and not each individual row... — Sterling
I figured out a way to get the norm with the sparse form using tfidf.multiply(tfidf).sum(1). See my answer for more details. — askewchan

askewchan askewchan · Accepted Answer · 2013-11-23T23:11:29

Some simple fake data:

a = np.arange(9.).reshape(3,3)
s = sparse.csr_matrix(a)

To get the norm of each row from the sparse, you can use:

np.sqrt(s.multiply(s).sum(1))

And the renormalized s would be

s.multiply(1/np.sqrt(s.multiply(s).sum(1)))

or to keep it sparse before renormalizing:

s.multiply(sparse.csr_matrix(1/np.sqrt(s.multiply(s).sum(1))))

To get ordinary matrix or array from it, use:

m = s.todense()
a = s.toarray()

If you have enough memory for the dense version, you can get the norm of each row with:

n = np.sqrt(np.einsum('ij,ij->i',a,a))

or

n = np.apply_along_axis(np.linalg.norm, 1, a)

To normalize, you can do

an = a / n[:, None]

or, to normalize the original array in place:

a /= n[:, None]

The [:, None] thing basically transposes n to be a vertical array.

Get norm of numpy sparse matrix rows

3 Answers