0
votes

I am trying to run Kmeans clustering on below set of data,

Name,Gender,Age,Drinks,Country
John,M,30,Pepsi,US
Jack,M,25,Coke,US
David,M,34,Pepsi,UK
Ted,M,37,Limca,CAN
Robert,M,23,Limca,US
Adrian,M,31,Pepsi,US
Craig,M,37,Coke,UK
Katie,F,23,Limca,UK
Nancy,F,32,Pepsi,UK

I want to cluster the data based on Drinks(pepsi,coke,Limca)and i am able to do it.But i want to retrive name also alongside clustered data.

the output i am getting is

0
1
2 
Limca belongs to cluster:0
Cokde belongs to cluster:0
etc.

here i want to get the names also.

while converting to sequence file i am taking key as drinks and value as the rest of text and converting to sparsevector and then running Kmeans clustering,the names are not printed. can anybody point how i extract name from the clusters which are there in values.

2

2 Answers

0
votes

You may need to convert {Pepsi, Coke, Pepsi, Limca} to something like {1001, 1002, 1001, 1003} and again convert back to original values.

But as mentioned in one of the answers, just getting a group by drinks may not be a clustering job, it's just an SQL query. if your problem is more complex than grouping then you can try the approach that I mentioned in above Paragraph.

0
votes

K-Means operates on vector spaces.

It absolutely needs to able to compute means.

But what is the mean value of {Pepsi, Coke, Pepsi, Limca}?

Sorry, you are trying to use a hammer, but you don't have a nail!

If you want to group data by their drink, this is not a clustering task.

Maybe try a Hadoop based SQL system. Because apparently you want to perform a classic SQL operation: GROUP BY Drinks

Oh, and your question is off-topic for stackoverflow. You are using Hadoop, but you are not posing a programming question!