Weka Simple K means handling nominal attributes

Question

I am trying to understand how simple K-means in Weka handles nominal attributes and why it is not efficient in handling such attributes.

I read that it calculates modes for such attributes. I want to know how the similarity is calculated.

Lets take an example: Consider a dataset with 3 numeric and a nomimal attribute. The nominal attribute has 3 values: A, B and C.

Instance1 has value A, Instance2 has value B and Instance3 has value A. In this case, Instance1 may be more similar to Instance3(depending on other numeric attributes of course). How will Simple K-means work in this case?

Follow up: What if the nominal attribute has more(10) possible values?

k-means IMHO only makes sense for continuous attributes. Anything else is a hack, and more often than not the results are only as good as random convex partitions. — Has QUIT--Anony-Mousse

Tjorriemorrie Tjorriemorrie · Accepted Answer · 2015-02-18T08:58:02

You can try to convert it to binary features, for each such nominal attribute, e.g. has_A, has_B, has_C. Then if you scale it i1 and i3 will be closer as the mean for that attribute will be above 0.5 (re to your example) - i2 will stand out more.

If it has more, then you just add more binary features for every possible value. Basically you just pivot each nominal attribute.

Weka Simple K means handling nominal attributes

1 Answers