I'm wondering if there are any good methods for selecting variables for k-means algorithm. I am trying to do the market segmentation using this algorithm and have a dataset with dozens of potential variables. I would like to have results that are fairly easy to interpret, so I should limit the number of variables to max. 5-6. I am particularly interested in solutions that can be implemented in SPSS Statistics or Weka. Also, is there a method/algorithm for getting the optimal number of variables for clustering (i.e. how many of the 'good' variables should I use)?
0
votes
In k-means clustering k is a user defined variable. It has to be defined by the user. Thus k-means is used when user has some idea about the number of clusters. If you don't have any idea about the number of clusters, you shouldn't use k-means rather use DBSCAN. Density-based clustering(DBSCAN) locates regions of high density that are separated from one another by regions of low density. You don't have to specify the number of clusters. The algorithm will find them by itself.
– Irshad Bhat
Thank you for trying to help me, but unfortunately it does not the answer to my question. I want to determine the variables (e.g. income, expenditures on various categories of goods) to be used in kmeans clustering - that is which variables are the most fit for the analysis. I also want to know, when I figure out the initial set of good ariables, if I can drop some of them to get more meaningful clustering (e.g. I have 5 variables, but 2 of them 'force the algorithm' to create clusters with very small counts - is there a way to find the variables that 'mess' with the algorithm?).
– Amade
2 Answers
0
votes
Try factor analysis, it should help. The no. of factors that you use will depend on the number of variables having eigen value >= 1.
After finding the no of factors, use the fa()
function to find loadings value and decide which variables you need to keep and which to discard. it will also help in removing the highly multicollinear variables.
0
votes
To get a better results with k-means, consider checking standard deviation for numeric features in raw data - wider data spread allows better separation of objects. Along with Factor Analysis, FA, one can consider using Principal Components Analysis, PCA to find out which features carry most of variance in data, and use features that are strongly expressed in resulting components.