Very Large and Very Sparse Non Negative Matrix factorization

Question

I have a very large and also sparse matrix (531K x 315K), the number of total cells is ~167 Billion. The non-zero values are only 1s. Total number of non-zero values are around 45K. Is there an efficient NMF package to solve my problem? I know there are couple of packages for that and they are working well only for small size of data matrix. Any idea helps. Thanks in advance.

Which lib did you try? What's your model? And how many components you are looking for? — sascha
I tried scikit-learn lib and it failed. Also, tried nipunbatra.github.io/blog/2017/nnmf-tensorflow.html and it also failed due to the out of memory issue. The latent dimension is 50. — mgokhanbakal

sascha sascha · Accepted Answer · 2017-11-28T16:41:11

scikit-learn will handle this easily!

Code:

from time import perf_counter as pc
import numpy as np
import scipy.sparse as sps
from sklearn.decomposition import NMF

""" Create sparse data """
nnz_i, nnz_j, nnz_val = np.random.choice(531000, size=45000), \
                        np.random.choice(315000, size=45000), \
                        np.random.random(size=45000)
X =  sps.csr_matrix((nnz_val, (nnz_i, nnz_j)), shape=(531000, 315000))
print('X-shape: ', X.shape, ' X nnzs: ', X.nnz)
print('type(X): ', type(X))
# <class 'scipy.sparse.csr.csr_matrix'> #                          !!!!!!!!!!

""" NMF """
model = NMF(n_components=50, init='random', random_state=0, verbose=True)

start_time = pc()
W = model.fit_transform(X)
end_time = pc()

print('Used (secs): ', end_time - start_time)
print(model.reconstruction_err_)
print(model.n_iter_)

Output:

X-shape:  (531000, 315000)  X nnzs:  45000
type(X):  <class 'scipy.sparse.csr.csr_matrix'>
violation: 1.0
violation: 0.2318929397542804
violation: 0.11045394409727402
violation: 0.08104138988253409
...
violation: 9.659665625799714e-05
Converged at iteration 71
Used (secs):  247.94092973091756
122.27109041
70

Remarks:

Make sure you use sparse-matrices as input or you can't exploit sparsity
I'm using version 0.19.1, so the multiplicative-update solver is used (>= 0.19)
- But the older CD-based solver should handle this too!
The above is using < 800 MB of memory

Additional Constraints

As mentioned in the comments, OP wants to add additional constraints, while still not specifying these formally.

This will need a whole new implementation of some optimization-procedure including some theory-footwork (depending on the constraints).

As an alternative, this can be solved by general-purpose Convex-Programming solvers. E.g. formulated by cvxpy and solved by SCS. Of course the alternating-minimization procedure needs to be done too (as the joint-problem is non-convex) and it will scale worse than this specialized sklearn-implementation. But it might work for OPs data.