1
votes

Does anyone has experience in creating sparse matrix with the non-zero values follows a uniform distribution of [-0.5, 0.5] and has zero mean (zero centered) in python (e.g. using Scipy.sparse)?

I am aware that scipy.sparse package provide a few method on creating random sparse matrix, like 'rand' and 'random'. However I could not achieve what I want with those method. For example, I tried:

import numpy as np
import scipy.sparse as sp

s = np.random.uniform(-0.5,0.5)
W=sp.random(1024, 1024, density=0.01, format='csc', data_rvs=s)

To specifiy my idea: Let say I want the above mentioned matrix which is non-sparse, or dense, I will create it by:

dense=np.random.rand(1024,1024)-0.5

'np.random.rand(1024,1024)' will create a dense uniform matrix with values in [0,1]. To make it zero mean, I centre the matrix by substract it 0.5.

However if I create a sparse matrix, let say:

sparse=sp.rand(1024,1024,density=0.01, format='csc')

The matrix will be having non-zero values in uniform [0,1]. However, if I want to centre the matrix, I cannot simply do 'sparse-=0.5' which will cause all the originally zero entries non-zero after substraction.

So, how can I achieve the same as for the above example for dense matrix on sparse matrix?

Thank you for all of your help!

3
Try to be more precise. Uniform in -0.5, 0.5 will not be sparse.sascha
@sascha I want a sparse matrix where the non-zero values follow a -0.5,0.5 uniform distribution.Dong Dong
Then just sample #nnz/2 values, position them; then do the same with those values multiplied with -1. ?sascha
@sascha Sorry I am new to the field. What does #nnz/2 mean? Thank youDong Dong
Half of the expected number of nonzeros. It was a simple construction (which might not be what you want; but i'm still trying to infer what you really want; because it still sounds vague).sascha

3 Answers

1
votes

The data_rvs parameter is expecting a "callable" that takes a size. This isn't exactly obvious from the documentation. This can be done with a lambda as follows:

import numpy as np
import scipy.sparse as sp

W = sp.random(1024, 1024, density=0.01, format='csc', 
              data_rvs=lambda s: np.random.uniform(-0.5, 0.5, size=s))

Then print(W) gives:

  (243, 0)  -0.171300809713
  (315, 0)  0.0739590145626
  (400, 0)  0.188151369316
  (440, 0)  -0.187384896218
    :   :
  (1016, 0) 0.29262088084
  (156, 1)  -0.149881296136
  (166, 1)  -0.490405135834
  (191, 1)  0.188167190147
  (212, 1)  0.0334533020488
  : :
  (411, 1)  0.122330200832
  (431, 1)  -0.0494334160833
  (813, 1)  -0.0076379249885
  (828, 1)  0.462807265425
  : :
  (840, 1021)   0.456423017883
  (12, 1022)    -0.47313075329
   :    :
  (563, 1022)   -0.477190349161
  (655, 1022)   -0.460942546313
  (673, 1022)   0.0930207181126
  (676, 1022)   0.253643616387
   :    :
  (843, 1023)   0.463793903168
  (860, 1023)   0.454427252782

For the newbie, the lambda may look odd - this is just an unnamed function. The sp.random function takes an optional argument data_rvs that defaults to None. When specified, it is expected to be a function that takes a size argument and returns that number of random numbers. A simple function to do this would be:

def generate_n_uniform_randoms(n):
    return np.uniform(-0.5, 0.5, n)

I don't know the origin of the API, but the shape is not needed as sp.random presumably first figures out which indices will be non-zero, and then it just needs to compute random values for those indices, which is a set of a known size.

The lambda is just syntactic sugar that allows us to define that function inline in terms of some other function call. We could instead write

W = sp.random(1024, 1024, density=0.01, format='csc', 
              data_rvs=generate_n_uniform_randoms)

Actually, this can be a "callable" - some object f for which f(n) returns n random variables. This can be a function, but it can also be an object of a class that implements the __call__(self, n) function. For example:

class ufoo(object):

    def __call__(self, n):
        import numpy
        return numpy.random.uniform(-0.5, 0.5, n)

W = sp.random(1024, 1024, density=0.01, format='csc', 
              data_rvs=ufoo())

If you need the mean to be exactly zero (within roundoff of course), this can be done by subtracting the mean from the non-zero values, as I mentioned above:

W.data -= np.mean(W.data)

Then:

W[idx].mean()

-2.3718641632430623e-18

1
votes

sparse.random does 2 things - distributes nonzeros randomly, and generates random uniform values.

In [62]: M = sparse.random(10,10,density=.2, format='csr')
In [63]: M
Out[63]: 
<10x10 sparse matrix of type '<class 'numpy.float64'>'
    with 20 stored elements in Compressed Sparse Row format>
In [64]: M.data
Out[64]: 
array([ 0.42825407,  0.51858978,  0.8084335 ,  0.08691635,  0.13210409,
        0.61288928,  0.39675205,  0.58242891,  0.5174367 ,  0.57859824,
        0.48812484,  0.13472883,  0.82992478,  0.70568697,  0.45001632,
        0.52147305,  0.72943809,  0.55801913,  0.97018861,  0.83236235])

You can modify the data values cheaply without changing the sparsity distribution:

In [65]: M.data -= 0.5
In [66]: M.A
Out[66]: 
array([[ 0.        ,  0.        ,  0.        , -0.07174593,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.01858978,  0.        ,  0.        ,  0.3084335 , -0.41308365,
         0.        ,  0.        ,  0.        ,  0.        , -0.36789591],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.11288928,
        -0.10324795,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.08242891,  0.0174367 ,  0.        ],
       [ 0.        ,  0.        ,  0.07859824,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        , -0.01187516,  0.        ,  0.        , -0.36527117],
       [ 0.        ,  0.        ,  0.32992478,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.20568697,
         0.        ,  0.        , -0.04998368,  0.        ,  0.        ],
       [ 0.02147305,  0.        ,  0.22943809,  0.05801913,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.47018861,  0.33236235,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ]])
In [67]: np.mean(M.data)
Out[67]: 0.044118297661574338

Or replacing the nonzero values with a new set of values:

In [69]: M.data = np.random.randint(-5,5,20)
In [70]: M
Out[70]: 
<10x10 sparse matrix of type '<class 'numpy.int32'>'
    with 20 stored elements in Compressed Sparse Row format>
In [71]: M.A
Out[71]: 
array([[ 0,  0,  0,  4,  0,  0,  0,  0,  0,  0],
       [-1,  0,  0,  1,  2,  0,  0,  0,  0, -4],
       [ 0,  0,  0,  0,  0,  4,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0, -5, -5,  0],
       [ 0,  0,  2,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0, -3,  0,  0,  3],
       [ 0,  0, -1,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0, -4,  0,  0, -1,  0,  0],
       [-1,  0, -5, -2,  0,  0,  0,  0,  0,  0],
       [ 0,  3,  1,  0,  0,  0,  0,  0,  0,  0]])
In [72]: M.data
Out[72]: 
array([ 4, -1,  1,  2, -4,  0,  4, -5, -5,  2, -3,  3, -1, -4, -1, -1, -5,
       -2,  3,  1])
0
votes

In my opinion, your requirements are still incomplete (see disadvantage mentioned below).

Here is some implementation for my simple construction outlined above in my comment:

import numpy as np
import scipy.sparse as sp

M, N, NNZ = 5, 5, 10
assert NNZ % 2 == 0
flat_dim = M*N

valuesA = np.random.uniform(-0.5, 0.5, size=NNZ // 2)
valuesB = valuesA * -1
values = np.hstack((valuesA, valuesB))
positions_flat = np.random.choice(flat_dim, size=NNZ, replace=False)
positions_2d = np.unravel_index(positions_flat, dims=(M, N))
mat = sp.coo_matrix((values, (positions_2d[0], positions_2d[1])), shape=(M, N))
print(mat.todense())
print(mat.data.mean())

Output:

[[ 0.          0.          0.          0.0273862   0.        ]
 [-0.3943963   0.          0.         -0.04134932  0.        ]
 [-0.10121743  0.         -0.0273862   0.          0.04134932]
 [ 0.3943963   0.          0.          0.          0.        ]
 [-0.24680983  0.          0.24680983  0.10121743  0.        ]]
0.0

Advantages

  • sparse
  • zero mean
  • entries from uniform distribution

Potential disadvantage:

  • for each value x in the matrix, somewhere -x is to be found!
    • meaning: it's not uniform in a more broad joint-distribution sense
    • if that's hurtful only you can tell
    • if yes: the above construction could be easily modified to use any centered values from some distribution, so your problem collapses into this somewhat smaller (but not necessarily much easier problem)

Now in regards to that linked problem: i'm guessing here, but i would not be surprised to see that sampling x values uniformly with the constraint mean(x)=0 is NP-hard.

Keep in mind, that a-posteriori centering of nonzeros, as recommend in the other answer, changes the underlying distribution (even for simple distributions). In some cases even invalidating bounds (leaving interval -0.5, 0.5).

This means: this question is all about formalizing which objective is how important and balance these out in some way.