2
votes

SVC appears to treat kernels that can take sparse matrices differently from those that don't. However, if a user-provided kernel is written to take sparse matrices, and a sparse matrix is provided during fit, it still converts the sparse matrix to dense and treats the kernel as dense because the kernel is not one of the sparse kernels pre-defined in scikit-learn.

Is there a way to force SVC to recognize the kernel as sparse and not convert the sparse matrix to dense before passing it to the kernel?

Edit 1: minimal working example

As an example, if upon creation, SVC is passed the string "linear" for the kernel, then the linear kernel is used, the sparse matrices are passed directly to the linear kernel, and the support vectors are stored as sparse matrices if a sparse matrix is provided when fitting. However, if instead the linear_kernel function itself is passed to SVC, then the sparse matrices are converted to ndarray before passing to the kernel, and the support vectors are stored as ndarray.

import numpy as np
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import linear_kernel
from sklearn.svm import SVC


def make_random_sparsemat(m, n=1024, p=.94):
    """Make mxn sparse matrix with 1-p probability of 1."""
    return csr_matrix(np.random.uniform(size=(m, n)) > p, dtype=np.float64)


X = make_random_sparsemat(100)
Y = np.asarray(np.random.uniform(size=(100)) > .5, dtype=np.float64)
model1 = SVC(kernel="linear")
model1.fit(X, Y)
print("Built-in kernel:")
print("Kernel treated as sparse: {}".format(model1._sparse))
print("Type of dual coefficients: {}".format(type(model1.dual_coef_)))
print("Type of support vectors: {}".format(type(model1.support_vectors_)))

model2 = SVC(kernel=linear_kernel)
model2.fit(X, Y)
print("User-provided kernel:")
print("Kernel treated as sparse: {}".format(model2._sparse))
print("Type of dual coefficients: {}".format(type(model2.dual_coef_)))
print("Type of support vectors: {}".format(type(model2.support_vectors_)))

Output:

Built-in kernel:
Kernel treated as sparse: True
Type of dual coefficients: <class 'scipy.sparse.csr.csr_matrix'>
Type of support vectors: <class 'scipy.sparse.csr.csr_matrix'>
User-provided kernel:
Kernel treated as sparse: False
Type of dual coefficients: <type 'numpy.ndarray'>
Type of support vectors: <type 'numpy.ndarray'>
1
can you put some example code in for us to play with?maxymoo
What's an example of a pre-defined sparse capable kernel? I know about sparse matrices, but not much about SVC.hpaulj
@maxymoo, I just added some code.saxen
@hpaulj, the predefined sparse kernels are specified in sklearn.svm.SVC._sparse_kernels. The linear kernel, for example, is a sparse kernel, in that it can take sparse matrices as inputs and handle them directly without just converting them to arrays, which would be less efficient.saxen

1 Answers

0
votes

I'm fishing around in the dark, working mainly from scikit-learn code that I find on github.

A lot of the SVC linear code appears to be in a C library. There is talk about its internal representation being sparse.

Your linear_kernel function just does:

X, Y = check_pairwise_arrays(X, Y)
return safe_sparse_dot(X, Y.T, dense_output=True)

If I make your X and Y

In [119]: X
Out[119]: 
<100x1024 sparse matrix of type '<class 'numpy.float64'>'
    with 6108 stored elements in Compressed Sparse Row format>
In [120]: 
In [120]: 
In [120]: Y = np.asarray(np.random.uniform(size=(100)) > .5, dtype=np.float64)

and recreate sparse_safe_dot

In [122]: safe_sparse_dot(Y,X,dense_output=True)
Out[122]: array([ 3.,  5.,  3., ...,  4.,  2.,  4.])

So applying that to Y and X (in the only order that makes sense), I get a dense array. Changing the dense_output parameter doesn't change things. Basically, Y*X, a sparse * a dense, returns a dense.

If I make Y sparse, then I can get a sparse product:

In [125]: Ym=sparse.csr_matrix(Y)
In [126]: Ym*X
Out[126]: 
<1x1024 sparse matrix of type '<class 'numpy.float64'>'
    with 1000 stored elements in Compressed Sparse Row format>
In [127]: safe_sparse_dot(Ym,X,dense_output=False)
Out[127]: 
<1x1024 sparse matrix of type '<class 'numpy.float64'>'
    with 1000 stored elements in Compressed Sparse Row format>
In [128]: safe_sparse_dot(Ym,X,dense_output=True)
Out[128]: array([[ 3.,  5.,  3., ...,  4.,  2.,  4.]])

I don't know the workings of SVC and fit, but just from working with sparse matrices, I know that you have to be careful when mixing sparse and dense matrices. It is easy to get a dense result, whether you want it or not.