0
votes

I have a large sparse matrix, in the scipy lil_matrix format the size is 281903x281903, it is an adjacency matrix https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.lil_matrix.html

I need a reliable way to get N indexes that are zero. I can't just draw all zero indexes and then choose random ones, since that makes my computer run out of memory. Are there a way of identifying N random indexes without having to trawl through the entire data-structure?

I currently get 10% of the non zero indices the following way (Y is my sparse matrix):

percent = 0.1

oneIdx = Y.nonzero()
numberOfOnes = len(oneIdx[0])
maskLength = int(math.floor(numberOfOnes * percent))
idxOne = np.array(random.sample(range(0,numberOfOnes), maskLength))

maskOne = tuple(np.asarray(oneIdx)[:,idxOne])

I am looking for way to get a mask with the same length as the non zero mask, but with zeros...

1
There is a nonzero() method that returns all non-zero indices. Maybe you can sample the complement?hilberts_drinking_problem
That is what I am doing right now, it results in a memory error!NicolaiF
Maybe I've misread your question, I thought that you were drawing all of the zero indices, not the non-zero indices.hilberts_drinking_problem
Yes, I am trying to draw N random zero indices, but if I take the compliment of all the non zero indices (2312497) to get all the indices that contains zeros 281903 * 281903 = 79466988912 - 2312497 = 79.466.988.912. python just throws an out of memory errorNicolaiF
I added an answer. I did not mean explicitly constructing a complement of non-zero values, since it is likely to be inefficient and consume too much memory. But you can sample the non-zero elements implicitly. Given that about 0.003% of all your values are non-zero, simple rejection sampling is likely to work well.hilberts_drinking_problem

1 Answers

2
votes

Here is an approach based on rejection sampling. Based on the numbers in your example, an index chosen uniformly at random is likely to be zero, so this will be a relatively efficient approach.

from scipy import sparse

dims = (281903, 281903)

mat = sparse.lil_matrix(dims, dtype=np.int)

for _ in range(1000):
    x, y = np.random.randint(0, dims[0], 2)
    mat[x, y] = 1


def sample_zero_forever(mat):
    nonzero_or_sampled = set(zip(*mat.nonzero()))
    while True:
        t = tuple(np.random.randint(0, mat.shape[0], 2))
        if t not in nonzero_or_sampled:
            yield t
            nonzero_or_sampled.add(t)


def sample_zero_n(mat, n=100):
    itr = sample_zero_forever(mat)
    return [next(itr) for _ in range(n)]