1
votes

I've been teaching myself Weka and have learned how to build models and get predictions out of them (predictions using the CLI).

When I run prediction on a data set from a previously built model I get a column that is the "prediction" also known as prediction confidence for each instance predicted.

I know what percent confidence means but shouldn't all my predictions be the accuracy of my Weka Model?

aka if I have a J48 Decision tree classifier with accuracy of 90%, shouldn't every classified instance using this model be 90% prediction confidence?

Any one know how this percentage confidence is calculated or how I should read the error prediction and model accuracy when telling others about my model? Thanks

1

1 Answers

2
votes

Basically, when a decision tree is training on a dataset, you often want to (or because of missing features have to) stop it before it overfits on every single training instance. When this happens, you will have several training samples at the leaf nodes in the tree. Very often the training labels will still be mixed at that point (not all positive class and not all negative class.)

The confidence is some measure of how consistent the training labels were by the time the tree got down to a leaf for that training instance.

Edit: note this is also used to handle missing features (attributes) in a clean and unbiased way.

See here for a brief definition of this.

Also look at some of Quinlan's work on decision trees for this. Particularly his work on C4.5

Also: "I know what percent confidence means but shouldn't all my predictions be the accuracy of my Weka Model?"

No, this isn't true, some training samples will be more easy to classify than others and these scores reflect this.