Using Scikit-learn (v 0.15.2) for non-negative matrix factorization on a large sparse matrix (less than 1% values > 0). I want to find factors by minimizing errors only on non-zero values of the matrix (i.e., do not calculate errors for entries that are zero), and to favor sparsity. I'm not sure if something's wrong with what I'm trying. The scikit-learn package's NMF and ProjectedGradientNMF have worked well for me before. But it seems that when the matrix size increases, the factorization is terribly slow.
I'm talking about matrices with > 10^10 cells. For matrix with ~10^7 cells, I find the executime time to be good.
The parameters I've used are as follows: nmf_model = NMF(n_components = 100, init='nndsvd', random_state=0, tol = 0.01, sparseness='data')
.
When I tried slightly different parameters (change to init=random
), I get the following warning. After the warning, the execution of the script halts.
/lib/python2.7/site-packages/sklearn/decomposition/nmf.py:252: UserWarning: Iteration limit reached in nls subproblem.
warnings.warn("Iteration limit reached in nls subproblem.")
Is there a way to make this faster and solve the above problem? I've tried using numpy sparse matrix (column- and row-sparse), but surprisingly - it's slower on the test I did with a smaller matrix (~10^7 cells).
Considering that one would have to run multiple iterations of such a factorization (to choose an ideal number of factors and k-fold cross validation), a faster way to solve this problem is highly desirable.
I'm also open to suggestions of package/tools that's not based on sklearn or Pyhon. I understand questions about package/tool choices are not encouraged, but for such a specific use-case, knowing what techniques others in the field use would be very helpful.