Let us say, I have a df of 20 columns and 10K rows. Since the data has a wide range values, I use the following code to normalize the data:
from sklearn.preprocessing import StandardScaler
min_max_scaler = preprocessing.StandardScaler()
df_scaled = min_max_scaler.fit_transform(df)
df_scaled
now contains both negative and positive values.
Now if I pass this normalized data frame to the spectral cluster as follows,
spectral = SpectralClustering(n_clusters = k,
n_init=30,
affinity='nearest_neighbors', random_state=cluster_seed,
assign_labels='kmeans')
clusters = spectral.fit_predict(df_scaled)
I will get the cluster lables.
Here is what confuses me: the official doc says that "Only kernels that produce similarity scores (non-negative values that increase with similarity) should be used. This property is not checked by the clustering algorithm."
Questions: Do the normalized negative values of df_scaled
affect the clustering result?
OR
Does it depend on the affinity computation I am using e.g. precomputed
, rbf
? If so how can I use the normalized input values to SpectralClustering?
My understanding is that normalizing could improve the clustering results and good for faster computation.
I appreciate any help or tips on how to I can approach the problem.