0
votes

I'm an Apache Mahout newbie. I'm trying to understand which of my named vectors belong to which cluster. A lot of resources on the internet are about text documents and use the commands clusterdump. However, my dataset is really huge and running the command always causes a Java Out Of Memory Exception. Besides, I don't think that using clusterdump would answer my question.

I would like to know if it's possible to understand nothing more than which named vectors belong to which clusters using the directories clusteredPoints and clusters-[0-9]+ and clusters-*-final

If it helps, so far, I have formed clusters of users based on their song listening habits. To do this, I initially created a sequence file using NamedVectors where the name of the NamedVector is the userId and the Vector itself is a double array containing weights of the tags of the songs listened by the user (an example is below).

    AR2TSU61187FB5C4F0 0.5 0.2 0.7 0.0 0.0 0.1 0.0 0.0 ...
    ...
    ...
    ...

I then ran k-means successfully. I have the output in the directory clusteredPoints (some 88 files with names such as part-m-00088) and the directory clusters that I believe contain the centroids.

Thanks for any help!

1

1 Answers

0
votes

I think that you need to do some research in clusterdump,try mahout clusterdump --help try this

mahout clusterdump -i clusters-*-final/part-r-00000 -o output -p clusteredPoints/part-m-00000

and try this link for further explanation.

you can also try to add the option -of CSV, you will have a display like this :

  • id_cluster1,vec1,vec2..vecl
  • id_cluster2,vec1,vec2..vecl
  • ...