0
votes

I start clustering using simple k-mean clustering in weka

after the clustering this result show

Number of iterations: 9

Within cluster sum of squared errors: 570.1974952009115

my questions:

  1. the number of sum of squared errors is huge does this mean my number of cluster is wrong ? and how to define the optimistic number of clusters ?

  2. how to split the data into training and test set to evaluate the performance ? and how to know the right percentage ?

  3. how to measure the SSB

2

2 Answers

1
votes

1.1 In k-means it's you who decides how many clusters to pick. You probably know this already.

1.2 In k-means there is no optimal number of clusters as in "global maximum of a function graph". You decide with respect to your business problem. See also "elbow method" for a semi-empirical procedure that seldom works in practice.

1.3 You might have outliers in your data which make the sum of squares large for any clustering operation. The outliers are always far away from your cluster centers, no matter how many clusters you pick .

2.1 There is no "optimal" percentage split.

2.2 You could use visualization to check if there is any overlap in the clusters. It's also more understandable for your audience to see the "decision boundaries".

3.1 What is SSB?

0
votes

I'm attaching code for the Elbow method in case anyone wants to do a quick test.

import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

X = pd.read_csv("data.csv")
X = data.select_dtypes(np.number) #If all your data is numerical, you dont have to do this

from sklearn.cluster import KMeans
wcss = []

for i in range(1,50):
    model = KMeans(n_clusters = i, init = 'k-means++',
                  max_iter=300,n_init=10,random_state=0)
    model.fit(X)
    wcss.append(model.inertia_)

plt.figure(figsize=(10,7))
plt.plot(range(1,50), wcss)
plt.title("Elbow Method")
plt.xlabel("No. of clusters")
plt.ylabel("WCSS")

If you have time and patience, you can make an outer loop to loop around random state and record it.