0
votes

I am trying to find the 'best' value of 'k' for k-means clustering by using a pipeline where I use a standard scaler followed by custom k-means which is finally followed by a Decision Tree classifier. I am then trying to use this pipeline for a Grid Search to get the best value of 'k'. Python 3.7 and sklearn is being used.

The code I have is as follows:

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_blobs
from sklearn.pipeline import Pipeline

import numpy as np
import matplotlib.pyplot as plt

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.cluster import KMeans
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV, RandomizedSearchC


class KMeansTransformer(BaseEstimator, TransformerMixin):

    def __init__(self, **kwargs):
        # The purpose of 'self.model' is to contain the
        # underlying cluster model-
        self.model = KMeans(**kwargs)


    def fit(self, X):
        self.X = X
        self.model.fit(X)


    def transform(self, X):
        pred = self.model.predict(X)
        return np.hstack([self.X, pred.reshape(-1, 1)])


    def fit_transform(self, X, y=None):
        self.fit(X)
        return self.transform(X)


# Create features and target-
X, y = make_blobs(n_samples=100, n_features=2, centers=3)

# Get shape/dimension-
X.shape, y.shape
# ((100, 2), (100,))


# Create another pipeline using Decision Tree as classifier-
pipe_dt = Pipeline(
    [
        ('sc', StandardScaler()),
        ('kmt', KMeansTransformer()),
        ('dt_clf', DecisionTreeClassifier())
    ]
)

# Train defined pipline-
pipe_dt.fit(X, y)

# Get accuracy score of pipeline-
pipe_dt.score(X, y)
# 1.0

# Make predictions using pipeline defined above-
y_pred_dt = pipe_dt.predict(X)


# Perform hyperparameter search/optimization using 'GridSearchCV'-
# Specify parameters to be hyper-tuned-
params = {
            'n_clusters': [2, 3, 5, 7]
            }

# Initialize GridSearchCV() object using 3-fold CV-
grid_kmt = GridSearchCV(param_grid=params, estimator=pipe_dt, cv = 3)

# Perform GridSearchCV on training data-
grid_kmt.fit(X, y)

When I use 'grid_kmt.fit(X, y)' it gives me the following error:

ValueError: Invalid parameter n_clusters for estimator Pipeline(memory=None, steps=[('sc', StandardScaler(copy=True, with_mean=True, with_std=True)), ('kmt', KMeansTransformer()), ('dt_clf', DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=None, splitter='best'))], verbose=False). Check the list of available parameters with estimator.get_params().keys().

However, when I initialize an object for custom kmeans-

# Initialize a new clustering object-
km = KMeansTransformer(n_clusters=3, init = 'k-means++')

# Get the list of available parameters-
km.get_params().keys()                                                  
# dict_keys([])

Then why am I getting a 'ValueError'? 'n_clusters' happens to be in the list of available parameters for custom clustering object.

Thanks!

1

1 Answers

1
votes

Looking closely at the error message:

ValueError: Invalid parameter n_clusters for estimator Pipeline [...]

it's clear that your GridSearchCV looks for a parameter n_clusters in the pipeline itself (not in its components, that is), can't find any, and returns an error. To correctly access the n_clusters parameter of your 'kmt', KMeansTransformer()) component, you should use

params = {
            'kmt__n_clusters': [2, 3, 5, 7]  # two underscores
            }

provided of course that your own KMeansTransformer does accept a parameter n_clusters.