2
votes

I am running a NaiveBayes model for text analysis with about more than 2000 variables and more than 30000 observations. It is really sparse data, but without any zero or constant column.

model <- NaiveBayes(nation~., data=data_train)

I am getting :

"Zero variances for at least one class in variables: "

and a list of 50 variables. The error is similar to the one bellow, but my class variable is a factor. https://stats.stackexchange.com/questions/35694/naive-bayes-fails-with-a-perfect-predictor.

I also ran e1071's naiveBayes on the same data. It runs, but the accuracy is ridiculously low (7%). And I get 85% with SVM. Any suggestion? Thanks.

1

1 Answers

2
votes

To my understanding, you must have some variables that are all zero for a certain class in your data. It's not the whole column of that variable is zero, but this vector data_train[data_train$Class=="ClassA",] (assuming one of your class is called "ClassA") is all-zero.

In this case, klaR gives you an error that warns you about this situation. But e1071 doesn't, and it will generate a Conditional probability of 0 for that variable at Class A. Thus leads to a "wrong" final probability when you try to calculate an unknown sample.

However SVM doesn't use this strategy of calculating probabilities for test samples. Therefore the zero-variance has almost no affect on its accuracy.