0
votes

I was recently introduced to clustering techniques because I was given the task to find "profiles" or "patterns" of professors of my university based on a survey they had to answer. I've been studying some of the avaible options to perform this and I came across the k-means clustering algorithm. Since most of my data is categorical I had to perform a one-hot-encoding (transforming the categorical variable in 0-1 single column vectors) and right after that I did a correlation analysis on Excel in order to exclude some redundant variables. After this I used python with pandas, numpy, matplotlib and sklearn libraries to perform a optimal cluster number check (elbow method) and then run k-means, finally.

This is the code I used to import the .csv with the data from the professors survey and to run the elbow method:

# loads the .csv dataframe (DF) 
df = pd.read_csv('./dados_selecionados.csv', sep=",")

# prints the df
print(df)

#list for the sum of squared distances
SQD = []

#cluster number for testing in elbow method
num_clusters = 10

# runs k-means for each cluster number
for k in range(1,num_clusters+1):
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(df)
    SQD.append(kmeans.inertia_)

# sets up the plot and show it
plt.figure(figsize=(16, 8))
plt.plot(range(1, num_clusters+1), SQD, 'bx-')
plt.xlabel('Número de clusters')
plt.ylabel('Soma dos quadrados das distâncias de cada ponto ao centro de seu cluster')
plt.title('Método do cotovelo')
plt.show()

This is the plot for the elbow method

According to the figure I decided to go with 3 clusters. After that I run k-means for 3 clusters and sent cluster data to a .xlsx with the following code:

# runs k-means
kmeans = KMeans(n_clusters=3, max_iter=100,verbose=2)
kmeans.fit(df)

clusters = kmeans.fit_predict(df)

# dict to store clusters data
cluster_dict=[]
for c in clusters:
    cluster_dict.append(c)

# prints the cluster dict
cluster_dict

# adds the cluster information as a column in the df
df['cluster'] = cluster_dict

# saves the df as a .xlsx
df.to_excel("3_clusters_k_means_selecionado.xlsx")

# shows the resulting df
print(df)

# shows each separate cluster
for c in clusters:
    print(df[df['cluster'] == c].head(10))

My main doubt right know is how to perform a reasonable analysis on each cluster data to understand how they were created? I began using means on each variable and also conditional formatting on Excel to see if some patterns would show up and they kind of did actually, but I think this is not the best option.

And I'm also going to use this post to ask for any recommendations on the whole method. Maybe some of the steps I took were not the best.

2

2 Answers

2
votes

If you're using scikit learns kmeans function, there is a parameter called n_init, which is the number of times the kmeans algorithm will run with different centroid seeds. By default it is set to 10 iteration, so essentially it does 10 different runs and outputs a single result with the lowest sum of squares. Another parameter you could mess around with is random_state which is a seed number to initialize the centroids randomly. This may give you better reproducibility because you choose the seed number, so if you see an optimal result you know which seed corresponds to that result.

0
votes

You may want to consider testing several different clustering algos. Here is a list of some of the popular ones.

https://scikit-learn.org/stable/modules/clustering.html

I think there are over 100 different clustering algos out there now.

Also, some clustering algos will automatically select the optimal number of clusters for you, so you don't have to 'guess'. I say guess, because the silhouette and elbow techniques will help quantify the K number for you, but you, yourself, still need to do some kind of guess-work.