8
votes

I need to perform kernel pca on a dataset of dimension (5000, 26421) to get a lower dimension representation. To choose the number of components (say k) parameter, I am performing the reduction of the data and reconstruction to the original space and getting the mean square error of the reconstructed and original data for different values of k.

I came across sklearn's gridsearch functionality and want to use it for the above parameter estimation. Since there is no score function for kernel pca, I have implemented a custom scoring function and passing it to Gridsearch.

from sklearn.decomposition.kernel_pca import KernelPCA
from sklearn.model_selection import GridSearchCV
import numpy as np
import math

def scorer(clf, X):
    Y1 = clf.inverse_transform(X)
    error = math.sqrt(np.mean((X - Y1)**2))
    return error

param_grid = [
    {'degree': [1, 10], 'kernel': ['poly'], 'n_components': [100, 400, 100]},
    {'gamma': [0.001, 0.0001], 'kernel': ['rbf'], 'n_components': [100, 400, 100]},
]

kpca = KernelPCA(fit_inverse_transform=True, n_jobs=30)
clf = GridSearchCV(estimator=kpca, param_grid=param_grid, scoring=scorer)
clf.fit(X)

However, it results in the below error:

/usr/lib64/python2.7/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X=array([[ 2.,  2.,  1., ...,  0.,  0.,  0.],
    ....,  0.,  1., ...,  0.,  0.,  0.]], dtype=float32), Y=array([[-0.05904257, -0.02796719,  0.00919842, ....        0.00148251, -0.00311711]], dtype=float32), precomp
uted=False, dtype=<type 'numpy.float32'>)
    117                              "for %d indexed." %
    118                              (X.shape[0], X.shape[1], Y.shape[0]))
    119     elif X.shape[1] != Y.shape[1]:
    120         raise ValueError("Incompatible dimension for X and Y matrices: "
    121                          "X.shape[1] == %d while Y.shape[1] == %d" % (
--> 122                              X.shape[1], Y.shape[1]))
        X.shape = (1667, 26421)
        Y.shape = (112, 100)
    123 
    124     return X, Y
    125 
    126 

ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 26421 while Y.shape[1] == 100

Can someone point out what exactly am I doing wrong?

1
First, PCA has a score() function. Second use make_scorer() to pass the custom score function to the gridSearch. - Vivek Kumar
I am not using PCA in this case but rather Kernel PCA which has no score function. Also tried using the make_scorer function but the approach doesn't work. - user1683894
I am facing this exact challenge. Did you figure it out? - MikeB2019x

1 Answers

10
votes

The syntax of scoring function is incorrect. You only need to pass the predicted and truth values for the classifiers. So this is how you declare your custom scoring function :

def my_scorer(y_true, y_predicted):
    error = math.sqrt(np.mean((y_true - y_predicted)**2))
    return error

Then you can use make_scorer function in Sklearn to pass it to the GridSearch.Be sure to set the greater_is_better attribute accordingly:

Whether score_func is a score function (default), meaning high is good, or a loss function, meaning low is good. In the latter case, the scorer object will sign-flip the outcome of the score_func.

I am assuming you are calculating an error, so this attribute should set as False, since lesser the error, the better:

from sklearn.metrics import make_scorer
my_func = make_scorer(my_scorer, greater_is_better=False)

Then you pass it to the GridSearch :

GridSearchCV(estimator=my_clf, param_grid=param_grid, scoring=my_func)

Where my_clf is your classifier.

One more thing, I don't think GridSearchCV is exactly what you are looking for. It basically accepts data in the form of train and test splits. But here you only want to transform your input data. You need to use Pipeline in Sklearn. Look at the example mentioned here of combining PCA and GridSearchCV.