
Let us say, I have a df of 20 columns and 10K rows. Since the data has a wide range values, I use the following code to normalize the data:

from sklearn.preprocessing import StandardScaler
min_max_scaler = preprocessing.StandardScaler()
df_scaled = min_max_scaler.fit_transform(df)

df_scaled now contains both negative and positive values. Now if I pass this normalized data frame to the spectral cluster as follows,

spectral = SpectralClustering(n_clusters = k, 
                                  affinity='nearest_neighbors', random_state=cluster_seed,
 clusters =  spectral.fit_predict(df_scaled)

I will get the cluster lables.

Here is what confuses me: the official doc says that "Only kernels that produce similarity scores (non-negative values that increase with similarity) should be used. This property is not checked by the clustering algorithm."

Questions: Do the normalized negative values of df_scaled affect the clustering result? OR Does it depend on the affinity computation I am using e.g. precomputed, rbf? If so how can I use the normalized input values to SpectralClustering? My understanding is that normalizing could improve the clustering results and good for faster computation. I appreciate any help or tips on how to I can approach the problem.


1 Answers


You are passing a data matrix, not a precomputed affinity matrix.

The "nearest neighbors" uses a binary kernel, which is non-negative.

To better understand the inner workings, please have a look at the source code.