2
votes

I have a data set in CSV which is a set of key value pairs, the data set is huge and the values are a mixture of integers and short strings (i.e. not lengthy texts, but rather key words) and I want to process it using Mahout's clustering algorithms.

The issue is in converting this CSV into vectors that can be consumed by Mahout. I have been reading "Mahout In Action" and there seems to be two options for vectorizing, using numeric values with Mahout's DenseVector, RandomAccessSparseVector, and SequentialAccessSparseVector implementation or use Vector Space Model to vectorize text documents.

The data I want to vectorize it not really a text document, but as it is a huge data set with many different keys and values it is difficult to map it to numeric values. What is the best way to vectorize this kind of data for use in Mahout?

Any pointers would be appreciated.

Thanks

1

1 Answers

0
votes

You are most likely to need a RandomAccessSparseVector.

  • Not a DenseVector, since most possible keys will not be represented. You have a mix of integers and strings as keys, and so it is a big keyspace.
  • Not a SequentialAccessSparseVector, since there does not seem to be natural ordering in your keyspace which would make a specific order of access more effective in running your algorithsm in Mahout.

You can easily try different vector representations to see which gives you the best performance.