How to include datetimes and other priority information for clustering?

Question

I want to cluster text. I kinda understand the concept of clustering text-only content from Mahout in Action:

make a mapping (int -> term) of all terms in the input and store into a dictionary
convert all input documents into a normalized sparse vector
do clustering

I want to cluster text as well as other information like date-time, location, people I was with. For example, I want documents made in a 10-day visit to a distant place to be placed into a distinct cluster.

I know I must write my own tool for making vectors from date-time, location, tags and (natural) text. How do I approach this? Should I use built-in tools to vectorize text and then integrate that output to my own vectors? What about weighing the dimensions?

andrew.butkus andrew.butkus · Accepted Answer · 2013-10-17T09:35:08

I cant give you full implementation details, as im not sure, but i can help you out with a piece of the puzzle. You will almost certainly need some context analysis to extract entities (such as location, time/date, person names)

For this take a look at OpenNLP.

http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html

in particular look at POS tagger, and namefinder.

Once you have extracted out the relevant entities, - you 'may' be able to do something with them using Mahout classification, (once you have extracted enough entities to train your model), but this i am not sure.

good luck

How to include datetimes and other priority information for clustering?

1 Answers