K-Means clustering Hyperparameter Tuning

Question

I am trying to perform hyperparameter tuning for Spatio-Temporal K-Means clustering by using it in a pipeline with a Decision Tree classifier. The idea is to use K-Means clustering algorithm to generate cluster-distance space matrix and clustered labels which will be then passed to Decision Tree classifier. For hyperparameter tuning, just use parameters for K-Means algorithm.

I am using Python 3.8 and sklearn 0.22.

The data I am interested is having 3 columns/attributes: 'time', 'x' and 'y' (x and y are spatial coordinates).

The code is:

class ST_KMeans(BaseEstimator, TransformerMixin):
# class ST_KMeans():
    """
    Note that K-means clustering algorithm is designed for Euclidean distances.
    It may stop converging with other distances, when the mean is no longer a
    best estimation for the cluster 'center'.

    The 'mean' minimizes squared differences (or, squared Euclidean distance).
    If you want a different distance function, you need to replace the mean with
    an appropriate center estimation.


    Parameters:

    k:  number of clusters

    eps1 : float, default=0.5
        The spatial density threshold (maximum spatial distance) between 
        two points to be considered related.

    eps2 : float, default=10
        The temporal threshold (maximum temporal distance) between two 
        points to be considered related.

    metric : string default='euclidean'
        The used distance metric - more options are
        ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’,
        ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’,
        ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘rogerstanimoto’, ‘sqeuclidean’,
        ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘yule’.

    n_jobs : int or None, default=-1
        The number of processes to start; -1 means use all processors (BE AWARE)


    Attributes:

    labels : array, shape = [n_samples]
        Cluster labels for the data - noise is defined as -1
    """

    def __init__(self, k, eps1 = 0.5, eps2 = 10, metric = 'euclidean', n_jobs = 1):
        self.k = k
        self.eps1 = eps1
        self.eps2 = eps2
        # self.min_samples = min_samples
        self.metric = metric
        self.n_jobs = n_jobs


    def fit(self, X):
        """
        Apply the ST K-Means algorithm 

        X : 2D numpy array. The first attribute of the array should be time attribute
            as float. The following positions in the array are treated as spatial
            coordinates.
            The structure should look like this [[time_step1, x, y], [time_step2, x, y]..]

            For example 2D dataset:
            array([[0,0.45,0.43],
            [0,0.54,0.34],...])


        Returns:

        self
        """

        # check if input is correct
        X = check_array(X)

        # type(X)
        # numpy.ndarray

        # Check arguments for DBSCAN algo-
        if not self.eps1 > 0.0 or not self.eps2 > 0.0:
            raise ValueError('eps1, eps2, minPts must be positive')

        # Get dimensions of 'X'-
        # n - number of rows
        # m - number of attributes/columns-
        n, m = X.shape


        # Compute sqaured form Euclidean Distance Matrix for 'time' and spatial attributes-
        time_dist = squareform(pdist(X[:, 0].reshape(n, 1), metric = self.metric))
        euc_dist = squareform(pdist(X[:, 1:], metric = self.metric))

        '''
        Filter the euclidean distance matrix using time distance matrix. The code snippet gets all the
        indices of the 'time_dist' matrix in which the time distance is smaller than 'eps2'.
        Afterward, for the same indices in the euclidean distance matrix the 'eps1' is doubled which results
        in the fact that the indices are not considered during clustering - as they are bigger than 'eps1'.
        '''
        # filter 'euc_dist' matrix using 'time_dist' matrix-
        dist = np.where(time_dist <= self.eps2, euc_dist, 2 * self.eps1)


        # Initialize K-Means clustering model-
        kmeans_clust_model = KMeans(
            n_clusters = self.k, init = 'k-means++',
            n_init = 10, max_iter = 300,
            precompute_distances = 'auto', algorithm = 'auto')

        # Train model-
        kmeans_clust_model.fit(dist)


        self.labels = kmeans_clust_model.labels_
        self.X_transformed = kmeans_clust_model.fit_transform(X)

        return self

    def transform(self, X):
        pass


# Initialize ST-K-Means object-
st_kmeans_algo = ST_KMeans(
    k = 5, eps1=0.6,
    eps2=9, metric='euclidean',
    n_jobs=1
    )

# Train on a chunk of dataset-
st_kmeans_algo.fit(data.loc[:500, ['time', 'x', 'y']])

# Get clustered data points labels-
kmeans_labels = st_kmeans_algo.labels

kmeans_labels.shape
# (501,)


# Get labels for points clustered using trained model-
kmeans_transformed = st_kmeans_algo.X_transformed

kmeans_transformed.shape
# (501, 5)


dtc = DecisionTreeClassifier()

dtc.fit(kmeans_transformed, kmeans_labels)

y_pred = dtc.predict(kmeans_transformed)

# Get model performance metrics-
accuracy = accuracy_score(kmeans_labels, y_pred)
precision = precision_score(kmeans_labels, y_pred, average='macro')
recall = recall_score(kmeans_labels, y_pred, average='macro')

print("\nDT model metrics are:")
print("accuracy = {0:.4f}, precision = {1:.4f} & recall = {2:.4f}\n".format(
    accuracy, precision, recall
    ))

# DT model metrics are:
# accuracy = 1.0000, precision = 1.0000 & recall = 1.0000

However, when I try to perform hyper-paramter tuning using sklearn's pipeline:

# Hyper-parameter Tuning:
# Define steps of pipeline-
pipeline_steps = [
    ('st_kmeans_algo' ,ST_KMeans(k = 5, eps1=0.6, eps2=9, metric='euclidean', n_jobs=1)),
    ('dtc', DecisionTreeClassifier())
    ]

# Instantiate a pipeline-
pipeline = Pipeline(pipeline_steps)

# Train pipeline-
pipeline.fit(kmeans_transformed, kmeans_labels)

It gives me the following error:

--------------------------------------------------------------------------- TypeError Traceback (most recent call last) in 8 9 # Train pipeline- ---> 10 pipeline.fit(kmeans_transformed, kmeans_labels)

~/.local/lib/python3.8/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params) 348 This estimator 349 """ --> 350 Xt, fit_params = self._fit(X, y, **fit_params) 351 with _print_elapsed_time('Pipeline', 352 self._log_message(len(self.steps) - 1)):

~/.local/lib/python3.8/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params) 309 cloned_transformer = clone(transformer) 310 # Fit or load from cache the current transformer --> 311 X, fitted_transformer = fit_transform_one_cached( 312 cloned_transformer, X, y, None, 313 message_clsname='Pipeline',

~/.local/lib/python3.8/site-packages/joblib/memory.py in call(self, *args, **kwargs) 353 354 def call(self, *args, **kwargs): --> 355 return self.func(*args, **kwargs) 356 357 def call_and_shelve(self, *args, **kwargs):

~/.local/lib/python3.8/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params) 726 with _print_elapsed_time(message_clsname, message): 727 if hasattr(transformer, 'fit_transform'): --> 728 res = transformer.fit_transform(X, y, **fit_params) 729 else: 730 res = transformer.fit(X, y, **fit_params).transform(X)

~/.local/lib/python3.8/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params) 572 else: 573 # fit method of arity 2 (supervised transformation) --> 574 return self.fit(X, y, **fit_params).transform(X) 575 576

TypeError: fit() takes 2 positional arguments but 3 were given

Parthasarathy Subburaj Parthasarathy Subburaj · Accepted Answer · 2020-05-25T08:15:36

The fit method in your ST_KMeans takes in only X as input but in this line:

pipeline.fit(kmeans_transformed, kmeans_labels)

you pass both X and Y as input to your pipeline which tries to call the fit method of the first stage of your pipeline i.e. ST_KKeans with these two arguments resulting in this error. In order to overcome this just add a dummy parameter y to the fit method of your ST_KMeans objects as shown below:

def fit(self, X, Y):

The additional parameter Y is not used anywhere inside the method it just maintains the consistency.

Hope this helps!

K-Means clustering Hyperparameter Tuning

1 Answers