0
votes

Mahout: The output of rowsimilarity process is different in each run of below mentioned steps (keeping all input same for all run)

Step1: seq2sparse (Creating vectors from text) Step2: rowid (generate tfidf vectors) Step3: rowsimilarity (calculate similarity between vectors) Step4: seqdumper (binary vectors to text)

UPDATE:

Thanks Pferrel for the reply,
Kindly suggest how can we specify the "seed value"

The commands which I am using are: ${MAHOUT_HOME}/bin/mahout seq2sparse -i ${DATA}/seq-data -o ${DATA}/vectors -n 2 -wt tfidf -ng 3 -nv -ow -md 100 -s 10

${MAHOUT_HOME}/bin/mahout rowid -i ${DATA}/vectors/tfidf-vectors/part-r-00000 -o ${DATA}/matrix

${MAHOUT_HOME}/bin/mahout rowsimilarity -i ${DATA}/matrix/matrix -o ${DATA}/similarity --similarityClassname SIMILARITY_COSINE -m 100 -ess -ow

1

1 Answers

0
votes

The data is randomly downsampled so set the seed to a fixed value if you want repeatability. You can also set the downsampling to kick in at a large number of items to disable it, but be aware that this will make it run slower, the speed will approach O(n^2).