1
votes

I have a Naive Bayes classifier that I wrote in Python using a Pandas data frame and now I need it in PySpark. My problem here is that I need the feature importance of each column. When looking through the PySpark ML documentation I couldn't find any info on it. documentation

Does anyone know if I can get the feature importance with the Naive Bayes Spark MLlib?

The code using Python is the following. The feature importance is retrieved with .coef_

df = df.fillna(0).toPandas()

X_df = df.drop(['NOT_OPEN', 'unique_id'], axis = 1)
X = X_df.values
Y = df['NOT_OPEN'].values.reshape(-1,1)

mnb = BernoulliNB(fit_prior=True) 
y_pred = mnb.fit(X, Y).predict(X)
estimator = mnb.fit(X, Y)


# coef_: For a binary classification problems this is the log of the estimated probability of a feature given the positive class. It means that higher values mean more important features for the positive class.

feature_names = X_df.columns
coefs_with_fns = sorted(zip(estimator.coef_[0], feature_names))
2

2 Answers

2
votes

If you're interested in an equivalent of coef_, the property, you're looking for, is NaiveBayesModel.theta

log of class conditional probabilities.

New in version 2.0.0.

i.e.

model = ...  # type: NaiveBayesModel

model.theta.toArray()  # type: numpy.ndarray

The resulting array is of size (number-of-classes, number-of-features), and rows correspond to consecutive labels.

0
votes

It is, probably, better to evaluate a difference
log(P(feature_X|positive)) - log(P(feature_X|negative)) as a feature importance.

Because, we are interested in the Discriminative power of each feature_X (sure-sure NB is a generative model). Extreme example: some feature_X1 has the same value across all + and - samples, so no discriminative power. So, the probability of this feature value is high for both + and - samples, but the difference of log probabilities = 0.