Im using Mahout's Naive Bayes algorithm to classify Amazon reviews into positive or negative review.
The data set isn't equally distributed. There are far more positive then negative reviews. A randomly picked test and training set with the mahout split using randomly picked tuples leads to good positive classification results but the false positive rate is also very high. Negative reviews are rarely classified as negative.
I guess an equally distributed training set with equal numbers of positive and negative tupels might solve the problem.
I've tried using mahout split with these options and then just switch training and test but this seems to only produce tupels for one class.
--testSplitSize (-ss) testSplitSize The number of documents
held back as test data for
each category
--testSplitPct (-sp) testSplitPct The % of documents held
back as test data for each
category
--splitLocation (-sl) splitLocation Location for start of test
data expressed as a
percentage of the input
file size (0=start,
50=middle, 100=end
Is there a way with mahout split or another to get proper training set?