Finding N random zero elements from a scipy sparse matrix

Question

I have a large sparse matrix, in the scipy lil_matrix format the size is 281903x281903, it is an adjacency matrix https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.lil_matrix.html

I need a reliable way to get N indexes that are zero. I can't just draw all zero indexes and then choose random ones, since that makes my computer run out of memory. Are there a way of identifying N random indexes without having to trawl through the entire data-structure?

I currently get 10% of the non zero indices the following way (Y is my sparse matrix):

percent = 0.1

oneIdx = Y.nonzero()
numberOfOnes = len(oneIdx[0])
maskLength = int(math.floor(numberOfOnes * percent))
idxOne = np.array(random.sample(range(0,numberOfOnes), maskLength))

maskOne = tuple(np.asarray(oneIdx)[:,idxOne])

I am looking for way to get a mask with the same length as the non zero mask, but with zeros...

There is a nonzero() method that returns all non-zero indices. Maybe you can sample the complement? — hilberts_drinking_problem
That is what I am doing right now, it results in a memory error! — NicolaiF
Maybe I've misread your question, I thought that you were drawing all of the zero indices, not the non-zero indices. — hilberts_drinking_problem
Yes, I am trying to draw N random zero indices, but if I take the compliment of all the non zero indices (2312497) to get all the indices that contains zeros 281903 * 281903 = 79466988912 - 2312497 = 79.466.988.912. python just throws an out of memory error — NicolaiF
I added an answer. I did not mean explicitly constructing a complement of non-zero values, since it is likely to be inefficient and consume too much memory. But you can sample the non-zero elements implicitly. Given that about 0.003% of all your values are non-zero, simple rejection sampling is likely to work well. — hilberts_drinking_problem

hilberts_drinking_problem hilberts_drinking_problem · Accepted Answer · 2018-06-03T11:32:08

Here is an approach based on rejection sampling. Based on the numbers in your example, an index chosen uniformly at random is likely to be zero, so this will be a relatively efficient approach.

from scipy import sparse

dims = (281903, 281903)

mat = sparse.lil_matrix(dims, dtype=np.int)

for _ in range(1000):
    x, y = np.random.randint(0, dims[0], 2)
    mat[x, y] = 1


def sample_zero_forever(mat):
    nonzero_or_sampled = set(zip(*mat.nonzero()))
    while True:
        t = tuple(np.random.randint(0, mat.shape[0], 2))
        if t not in nonzero_or_sampled:
            yield t
            nonzero_or_sampled.add(t)


def sample_zero_n(mat, n=100):
    itr = sample_zero_forever(mat)
    return [next(itr) for _ in range(n)]

Finding N random zero elements from a scipy sparse matrix

1 Answers