2
votes

I am using NB for document classification and trying to understand threshold parameter to see how it can help to optimize algorithm.

Spark ML 2.0 thresholds doc says:

Param for Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >= 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class' threshold.

0) Can someone explain this better? What goal it can achieve? My general idea is if you have threshold 0.7 then at least one class prediction probability should be more then 0.7 if not then prediction should return empty. Means classify it as 'uncertain' or just leave empty for prediction column. How can p/t function going to achieve that when you still pick the category with max probability?

1) What probability it adjust? default column 'probability' is actually conditional probability and 'rawPrediction' is confidence according to document. I believe threshold will adjust 'rawPrediction' not 'probability' column. Am I right?

2) Here's how some of my probability and rawPrediction vector look like. How do I set threshold values based on this so I can remove certain uncertain classification? probability is between 0 and 1 but rawPrediction seems to be on log scale here.

Probability: [2.233368649314982E-15,1.6429456680945863E-9,1.4377313514127723E-15,7.858651849363202E-15]

rawPrediction: [-496.9606736723107,-483.452183395287,-497.40111830218746]

Basically I want classifier to leave Prediction column empty if it doesn't have any probability that is more then 0.7 percent.

Also, how to classify something as uncertain when more then one category has very close scores e.g. 0.812, 0.800, 0.799 . Picking max is something I may not want here but instead classify as "uncertain" or leave empty and I can do further analysis and treatment for those documents or train another model for those docs.

1

1 Answers

2
votes

I haven't played with it, but the intent is to supply different threshold values for each class. I've extracted this example from the docstring:

model = nb.fit(df)
>>> result.prediction
1.0
>>> result.probability
DenseVector([0.42..., 0.57...])
>>> result.rawPrediction
DenseVector([-1.60..., -1.32...])
>>> nb = nb.setThresholds([0.01, 10.00])
>>> model3 = nb.fit(df)
>>> result = model3.transform(test0).head()
>>> result.prediction
0.0

If I understand correctly, the effect was to transform [0.42, 0.58] into [.42/.01, .58/10] = [42, 5.8], switching the prediction ("largest p/t") from column 1 (third row above) to column 0 (last row above). However, I couldn't find the logic in the source. Anyone?

Stepping back: I do not see a built-in way to do what you want: be agnostic if no class dominates. You will have to add that with something like:

def weak(probs, threshold=.7, epsilon=.01):
    return np.all(probs < threshold) or np.max(np.diff(probs)) < epsilon

>>> cases = [[.5,.5],[.5,.7],[.7,.705],[.6,.1]]
>>> for case in cases:
...    print '{:15s} - {}'.format(case, weak(case))

[0.5, 0.5]      - True
[0.5, 0.7]      - False
[0.7, 0.705]    - True
[0.6, 0.1]      - True

(Notice I haven't checked whether probs is a legal probability distribution.)

Alternatively, if you are not actually making a hard decision, use the predicted probabilities and a metric like Brier score, log loss, or info gain that accounts for the calibration as well as the accuracy.