2
votes

Basically, sklearn has naive bayes with Gaussian kernel which can class numeric variables.

However, how to deal with data set containing numeric variables and category variables together.

For example, give a dataset below, how use sklearn train mixed data type together without discreting numeric variables?

+-------+--------+-----+-----------------+
| Index | Gender | Age | Product_Reviews |
+-------+--------+-----+-----------------+
| A     | Female |  20 | Good            |
| B     | Male   |  21 | Bad             |
| C     | Female |  25 | Bad             |
+-------+--------+-----+-----------------+

I mean, for Bayes classification, P(A|B)= P(B|A)*P(A)/P(B).

For category variables, P(B|A) is easy to count out, but for numeric variables, it should follows Gaussian distribution. And assume we have got P(B|A) with Gaussian distribution.

Is there any package can directly work with these together?

Please be note: this question is not duplicated with How can I use sklearn.naive_bayes with (multiple) categorical features? and Mixing categorial and continuous data in Naive Bayes classifier using scikit-learn

Because this question is not wanna do a naive bayes with dummy variables(1st question) and also do not wanna do a model ensemble(2nd question solution2).

The mathematic algorithm is here https://tom.host.cs.st-andrews.ac.uk/ID5059/L15-HsuPaper.pdf , which calculates conditional probabilities with Gaussian distribution instead of counting number with numeric variables. And make classification with all conditional probabilities including category variables(by counting number) and numeric variables(Gaussian distribution)

1

1 Answers

2
votes

The answer comes directly from the mathematics of Naive Bayes

  1. Categorical variables provide you with log P(a|cat) ~ SUM_i log P(cat_i|a) + log P(a) (I am omitting division by P(cat), as what NB implementation returns is also ignoring it)

  2. Continuous variables give you the same thing, log P(a|con) ~ SUM_i log P(con_i|a) + log P(a) (I am omitting division by P(cat), as what NB implementation returns is also ignoring it)

and since in Naive Bayes features are independent we get that for x which contains both categorical and continuous

P(a|x) ~ SUM_i log(x_i | a) + log P(a) = SUM_i log P(cat_i|a) + log P(a) + SUM_i log P(con_i|a) + log P(a) - log P(a) = log likelihood from categorical model + log likelihood from continuous model - log prior of class a

all these elements you can read out from your two models, independently fitted to each part of the data. Notice that this is not an ensemble, you simply fit two models and construct one on your own due to specific assumptions of naive bayes, thus you are overcoming implementational limitation this way, yet still efficiently constructing valid NB model on mixed distributions. Note that this works for any set of mixed distributions, thus you could do the same given more different NBs (using different distributions).