7
votes

I am using Spark 1.5.0 MLlib Random Forest algorithm (Scala code) to do two-class classification. As the dataset I am using is highly imbalanced, so the majority class is down sampled at 10% sampling rate.

Is it possible to use the sampling weight (10 in this case) in the Spark Random Forest training? I don't see weight among the input parameters for trainClassifier() in Random Forest.

1

1 Answers

2
votes

Not at all in Spark 1.5 and only partially (Logistic/LinearRegression) in Spark 1.6

https://issues.apache.org/jira/browse/SPARK-7685

Here's the umbrella JIRA tracking all the subtasks

https://issues.apache.org/jira/browse/SPARK-9610