machine learning - Python Non negative Matrix Factorization that handles both zeros and missing data?

Question

I look for a NMF implementation that has a python interface, and handles both missing data and zeros.

I don't want to impute my missing values before starting the factorization, I want them to be ignored in the minimized function.

It seems that neither scikit-learn, nor nimfa, nor graphlab, nor mahout propose such an option.

Thanks!

Have you tried the implementation in scikit learn already? What problems does it give you? scikit-learn.org/stable/modules/generated/… — emiguevara
I mean, do you have problems because of imputing the missing values? why you would not want to do it is beyond my understanding. In general, if you do not impute missing values, then the vector is not valid and must be discarded from the computation. — emiguevara
Let's take the classic example of user x movies ratings matrix. For sure, the users will have rated only a small percentage of the movies, so there is a lot of missing values in the input matrix X. So what you want to do, is to guess the matrix factors (WH = X) by factorizing the matrix only from the available ratings, and then estimate the missing ones with the W and H you obtained. But I found a way of adding this to the current projected gradient implementation of scikit-learn, I will propose a pull request soon. — Théo T
Here is a very good explanation of this for general matrix factorization (without the non negativity constraint): quuxlabs.com/blog/2010/09/… — Théo T
Very nice write up, thanks. It's not python, but there is a toolbox for Matlab with all the bells and whistles: sites.google.com/site/nmftool — emiguevara

Tautvydas Tautvydas · Accepted Answer · 2014-07-06T14:04:59

Using this Matlab to python code conversion sheet I was able to rewrite NMF from Matlab toolbox library.
I had to decompose a 40k X 1k matrix with sparsity of 0.7%. Using 500 latent features my machine took 20 minutes for 100 iteration.

Here is the method:

import numpy as np
from scipy import linalg
from numpy import dot

def nmf(X, latent_features, max_iter=100, error_limit=1e-6, fit_error_limit=1e-6):
    """
    Decompose X to A*Y
    """
    eps = 1e-5
    print 'Starting NMF decomposition with {} latent features and {} iterations.'.format(latent_features, max_iter)
    X = X.toarray()  # I am passing in a scipy sparse matrix

    # mask
    mask = np.sign(X)

    # initial matrices. A is random [0,1] and Y is A\X.
    rows, columns = X.shape
    A = np.random.rand(rows, latent_features)
    A = np.maximum(A, eps)

    Y = linalg.lstsq(A, X)[0]
    Y = np.maximum(Y, eps)

    masked_X = mask * X
    X_est_prev = dot(A, Y)
    for i in range(1, max_iter + 1):
        # ===== updates =====
        # Matlab: A=A.*(((W.*X)*Y')./((W.*(A*Y))*Y'));
        top = dot(masked_X, Y.T)
        bottom = (dot((mask * dot(A, Y)), Y.T)) + eps
        A *= top / bottom

        A = np.maximum(A, eps)
        # print 'A',  np.round(A, 2)

        # Matlab: Y=Y.*((A'*(W.*X))./(A'*(W.*(A*Y))));
        top = dot(A.T, masked_X)
        bottom = dot(A.T, mask * dot(A, Y)) + eps
        Y *= top / bottom
        Y = np.maximum(Y, eps)
        # print 'Y', np.round(Y, 2)


        # ==== evaluation ====
        if i % 5 == 0 or i == 1 or i == max_iter:
            print 'Iteration {}:'.format(i),
            X_est = dot(A, Y)
            err = mask * (X_est_prev - X_est)
            fit_residual = np.sqrt(np.sum(err ** 2))
            X_est_prev = X_est

            curRes = linalg.norm(mask * (X - X_est), ord='fro')
            print 'fit residual', np.round(fit_residual, 4),
            print 'total residual', np.round(curRes, 4)
            if curRes < error_limit or fit_residual < fit_error_limit:
                break

return A, Y

Here I was using Scipy sparse matrix as input and missing values were converted to 0 using toarray() method. Therefore, the mask was created using numpy.sign() function. However, if you have nan values you could get same results by using numpy.isnan() function.

machine learning - Python Non negative Matrix Factorization that handles both zeros and missing data?

2 Answers

Goal: Our goal is given a matrix A, decompose it into two non-negative factors, as follows:

Overview

Step 1: Learning H, given A and W

Handling missing entries in A

Step 2: Learning W, given A and H

Code example

Imports

Creating matrix to be factorised

Masking a few entries

Defining matrices W and H

Defining the cost that we want to minimise

Alternating NNLS procedure