1
votes

I'm struggling to plot bar charts for the KMeans-based clustering algorithm. The problem is I want to demonstrate clusters in such a way that the very outlier cluster can be depicted at the end of the x-axis & the rest of the clusters stay relatively next to each other. I think the problem is xsticks, which are equally distributed on x-axis:

---|---|---|-----------------> x-axis
0  1   2   3 

in this context, I want to show that, e.g. cluster predicted with labelled 3 based on Score located a bit far which needs some adjustment concerning bins width maybe like this:

---|---|--------------|------> x-axis
0  1   2              3 

So far I reached the following results to demonstrate results of the KM-based algorithm concerning outlier detection: img

import numpy as np
import pandas as pd
df = pd.DataFrame(data={'attr1':[1,1,1,1,2,2,2,2,2,2,2,2,3,5,5,6,6,7,7,7,7,7,7,7,15],
                        'attr2':[1,1,1,1,2,2,2,2,2,2,2,2,3,5,5,6,6,7,7,7,13,13,13,14,15],
                        'Score':[-0.505830,-0.505830,-0.505830,-0.505830,-0.505830,-0.505830,-0.505830,-0.505830,-0.505830,-0.505830,-0.505830,-0.505830,-0.505830,
                                 -0.775851,-0.775851,-0.775851,-0.775851,-0.775851,-0.775851,-0.775851,-0.987357,-0.987357,-0.987357,-0.987357,-1.994758]
                        })
#df

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

def kmeans_barplot(df, n_clusters):
    km = KMeans(init='k-means++', n_clusters=n_clusters)
    km_clustering = km.fit(df)
    
    #plot with seaborn
    plt.figure(figsize=(10,5))
    sns.countplot(x=km_clustering.labels_ , data=df.assign(hue=km_clustering.labels_), hue='hue')
    
    #plot with Matplotlib
    plt.figure(figsize=(20,10))
    df.groupby(km_clustering.labels_)['Score'].value_counts().unstack().plot.bar()
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.gcf().set_size_inches(10, 5)
    plt.xlabel('Cluster labels', fontsize=18)
    plt.ylabel('count', fontsize=16)

kmeans_barplot(df, 3)

You can find my entire code, including this KM-based algorithm, in colab notebook for quick debugging. Please feel free to implement your solutions on notebook or comment on cells if you need it, or some changes within the ODKM algorithm itself (where KM clustering executing) has been scripted can access in the form of @class ODKM. Maybe it is better to extract predicted cluster labels and add a new column under the title of Cluster_label next to the ODKM algorithm Score for better access to the bar plots.

The expected output should be like this(better bins within the same clusters have the same color , e. g. 1st cluster C1):

img