1
votes

I have a dataset containing 61 rows(users) and 26 columns, on which I apply clustering with k-means and others algorithms. first applied KMeans on the dataset after normalizing it. As a prior task I run k-means on this data after normalizing it and identified 10 clusters. In parallel I also tried to visualize these clusters that's why i use PCA to reduce the number of my features.

I have written the following code:

UserID  Communication_dur   Lifestyle_dur   Music & Audio_dur   Others_dur  Personnalisation_dur    Phone_and_SMS_dur   Photography_dur Productivity_dur    Social_Media_dur    System_tools_dur    ... Music & Audio_Freq  Others_Freq Personnalisation_Freq   Phone_and_SMS_Freq  Photography_Freq    Productivity_Freq   Social_Media_Freq   System_tools_Freq   Video players & Editors_Freq    Weather_Freq
1   63  219 9   10  99  42  36  30  76  20  ... 2   1   11  5   3   3   9   1   4   8
2   9   0   0   6   78  0   32  4   15  3   ... 0   2   4   0   2   1   2   1   0   0


from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA 

Sc = StandardScaler()
X = Sc.fit_transform(df)
pca = PCA(3) 
pca.fit(X) 
pca_data = pd.DataFrame(pca.transform(X)) 
print(pca_data.head())

gives the following results:

   0  1  2
0  8 -4  5
1 -2 -2  1
2  1  1 -0
3  2 -1  1
4  3 -1 -3

I want to show a plot (cluster) of my dataset by using a PCA and interpret the results ? I am really new in this space and advice would be greatly appreciated!

Thanks in advance once again.

1
You want them 3D or 2D? 2D would be easier, but now you have 3D.Frightera
I want to 2d ! i can change pca = PCA(2)ab20225
Does this answer your question? How to plot clusters in python?Flavia Giammarino
No I can't find any solution !ab20225

1 Answers

3
votes

Using an example dataset:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA 
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

df, y = make_blobs(n_samples=70, centers=10,n_features=26,random_state=999,cluster_std=1)

Perform scaling, PCA and put the PC scores into a dataframe:

Sc = StandardScaler()
X = Sc.fit_transform(df)
pca = PCA(2) 
pca_data = pd.DataFrame(pca.fit_transform(X),columns=['PC1','PC2']) 

Perform kmeans and place the label into a data frame and you can already plot it using seaborn:

kmeans =KMeans(n_clusters=10).fit(X)
pca_data['cluster'] = pd.Categorical(kmeans.labels_)
sns.scatterplot(x="PC1",y="PC2",hue="cluster",data=pca_data)

enter image description here

Or matplotlib:

fig,ax = plt.subplots()
scatter = ax.scatter(pca_data['PC1'], pca_data['PC2'],c=pca_data['cluster'],cmap='Set3',alpha=0.7)
legend1 = ax.legend(*scatter.legend_elements(),
                    loc="upper left", title="")
ax.add_artist(legend1)

enter image description here