I have a data set in CSV which is a set of key value pairs, the data set is huge and the values are a mixture of integers and short strings (i.e. not lengthy texts, but rather key words) and I want to process it using Mahout's clustering algorithms.
The issue is in converting this CSV into vectors that can be consumed by Mahout. I have been reading "Mahout In Action" and there seems to be two options for vectorizing, using numeric values with Mahout's DenseVector, RandomAccessSparseVector, and SequentialAccessSparseVector implementation or use Vector Space Model to vectorize text documents.
The data I want to vectorize it not really a text document, but as it is a huge data set with many different keys and values it is difficult to map it to numeric values. What is the best way to vectorize this kind of data for use in Mahout?
Any pointers would be appreciated.
Thanks