I want to cluster text. I kinda understand the concept of clustering text-only content from Mahout in Action:
- make a mapping (int -> term) of all terms in the input and store into a dictionary
- convert all input documents into a normalized sparse vector
- do clustering
I want to cluster text as well as other information like date-time, location, people I was with. For example, I want documents made in a 10-day visit to a distant place to be placed into a distinct cluster.
I know I must write my own tool for making vectors from date-time, location, tags and (natural) text. How do I approach this? Should I use built-in tools to vectorize text and then integrate that output to my own vectors? What about weighing the dimensions?