Hierarchical agglomerative clustering

Question

Can we use Hierarchical agglomerative clustering for clustering data in this format ?

"beirut,proff,email1"
"beirut,proff,email2"
"swiss,aproff,email1"
"france,instrc,email2"
"swiss,instrc,email2"
"beirut,proff,email1"
"swiss,instrc,email2"
"france,aproff,email2"

If not, what is the compatible clustering algorithm to cluster data with string values ?

Thank you for your help!

Sneftel Sneftel · Accepted Answer · 2014-05-11T15:20:13

Any type of clustering requires a distance metric. If all you're willing to do with your strings is treat them as equal to each other or not equal to each other, the best you can really do is the field-wise Hamming distance... that is, the distance between "abc,def,ghi" and "uvw,xyz,ghi" is 2, and the distance between "abw,dez,ghi" is also 2. If you want to cluster similar strings within a particular field -- say clustering "Slovakia" and "Slovenia" because of the name similarity, or "Poland" and "Ukraine" because they border each other, you'll use more complex metrics. Given a distance metric, hierarchical agglomerative clustering should work fine.

All this assumes, however, that clustering is what you actually want to do. Your dataset seems like sort of an odd use-case for clustering.

Hierarchical agglomerative clustering

2 Answers