I'm struggling to plot bar charts for the KMeans-based clustering algorithm. The problem is I want to demonstrate clusters in such a way that the very outlier cluster can be depicted at the end of the x-axis & the rest of the clusters stay relatively next to each other. I think the problem is xsticks
, which are equally distributed on x-axis:
---|---|---|-----------------> x-axis
0 1 2 3
in this context, I want to show that, e.g. cluster predicted with labelled 3
based on Score
located a bit far which needs some adjustment concerning bins width maybe like this:
---|---|--------------|------> x-axis
0 1 2 3
So far I reached the following results to demonstrate results of the KM-based algorithm concerning outlier detection:
import numpy as np
import pandas as pd
df = pd.DataFrame(data={'attr1':[1,1,1,1,2,2,2,2,2,2,2,2,3,5,5,6,6,7,7,7,7,7,7,7,15],
'attr2':[1,1,1,1,2,2,2,2,2,2,2,2,3,5,5,6,6,7,7,7,13,13,13,14,15],
'Score':[-0.505830,-0.505830,-0.505830,-0.505830,-0.505830,-0.505830,-0.505830,-0.505830,-0.505830,-0.505830,-0.505830,-0.505830,-0.505830,
-0.775851,-0.775851,-0.775851,-0.775851,-0.775851,-0.775851,-0.775851,-0.987357,-0.987357,-0.987357,-0.987357,-1.994758]
})
#df
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
def kmeans_barplot(df, n_clusters):
km = KMeans(init='k-means++', n_clusters=n_clusters)
km_clustering = km.fit(df)
#plot with seaborn
plt.figure(figsize=(10,5))
sns.countplot(x=km_clustering.labels_ , data=df.assign(hue=km_clustering.labels_), hue='hue')
#plot with Matplotlib
plt.figure(figsize=(20,10))
df.groupby(km_clustering.labels_)['Score'].value_counts().unstack().plot.bar()
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.gcf().set_size_inches(10, 5)
plt.xlabel('Cluster labels', fontsize=18)
plt.ylabel('count', fontsize=16)
kmeans_barplot(df, 3)
You can find my entire code, including this KM-based algorithm, in colab notebook for quick debugging. Please feel free to implement your solutions on notebook or comment on cells if you need it, or some changes within the ODKM
algorithm itself (where KM clustering executing) has been scripted can access in the form of @class ODKM
. Maybe it is better to extract predicted cluster labels and add a new column under the title of Cluster_label
next to the ODKM
algorithm Score
for better access to the bar plots.
The expected output should be like this(better bins within the same clusters have the same color , e. g. 1st cluster C1
):