Mahout: How to split into equally distributed training sets

Question

Im using Mahout's Naive Bayes algorithm to classify Amazon reviews into positive or negative review.

The data set isn't equally distributed. There are far more positive then negative reviews. A randomly picked test and training set with the mahout split using randomly picked tuples leads to good positive classification results but the false positive rate is also very high. Negative reviews are rarely classified as negative.

I guess an equally distributed training set with equal numbers of positive and negative tupels might solve the problem.

I've tried using mahout split with these options and then just switch training and test but this seems to only produce tupels for one class.

 --testSplitSize (-ss) testSplitSize               The number of documents
                                                 held back as test data for
                                                 each category
 --testSplitPct (-sp) testSplitPct                  The % of documents held
                                                 back as test data for each
                                                 category
 --splitLocation (-sl) splitLocation                Location for start of test
                                                 data expressed as a
                                                 percentage of the input
                                                 file size (0=start,
                                                 50=middle, 100=end

Is there a way with mahout split or another to get proper training set?

duffymo duffymo · Accepted Answer · 2014-08-07T17:45:25

I would say the training and test sets should reflex the under population. I would not create a test set with equal positive and negative reviews.

A better solution might be to create multiple sets via bootstrapping. Let a committee vote improve your results.

Mahout: How to split into equally distributed training sets

1 Answers