I'm using scikit-learn in Python to develop a classification algorithm to predict the gender of certain customers. Amongst others, I want to use the Naive Bayes classifier but my problem is that I have a mix of categorical data (ex: "Registered online", "Accepts email notifications" etc) and continuous data (ex: "Age", "Length of membership" etc). I haven't used scikit much before but I suppose that that Gaussian Naive Bayes is suitable for continuous data and that Bernoulli Naive Bayes can be used for categorical data. However, since I want to have both categorical and continuous data in my model, I don't really know how to handle this. Any ideas would be much appreciated!
4 Answers
You have at least two options:
Transform all your data into a categorical representation by computing percentiles for each continuous variables and then binning the continuous variables using the percentiles as bin boundaries. For instance for the height of a person create the following bins: "very small", "small", "regular", "big", "very big" ensuring that each bin contains approximately 20% of the population of your training set. We don't have any utility to perform this automatically in scikit-learn but it should not be too complicated to do it yourself. Then fit a unique multinomial NB on those categorical representation of your data.
Independently fit a gaussian NB model on the continuous part of the data and a multinomial NB model on the categorical part. Then transform all the dataset by taking the class assignment probabilities (with
predict_proba
method) as new features:np.hstack((multinomial_probas, gaussian_probas))
and then refit a new model (e.g. a new gaussian NB) on the new features.
The simple answer: multiply result!! it's the same.
Naive Bayes based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features - meaning you calculate the Bayes probability dependent on a specific feature without holding the others - which means that the algorithm multiply each probability from one feature with the probability from the second feature (and we totally ignore the denominator - since it is just a normalizer).
so the right answer is:
- calculate the probability from the categorical variables.
- calculate the probability from the continuous variables.
- multiply 1. and 2.
Hope I'm not too late. I recently wrote a library called Mixed Naive Bayes, written in NumPy. It can assume a mix of Gaussian and categorical (multinoulli) distributions on the training data features.
https://github.com/remykarem/mixed-naive-bayes
The library is written such that the APIs are similar to scikit-learn's.
In the example below, let's assume that the first 2 features are from a categorical distribution and the last 2 are Gaussian. In the fit()
method, just specify categorical_features=[0,1]
, indicating that Columns 0 and 1 are to follow categorical distribution.
from mixed_naive_bayes import MixedNB
X = [[0, 0, 180.9, 75.0],
[1, 1, 165.2, 61.5],
[2, 1, 166.3, 60.3],
[1, 1, 173.0, 68.2],
[0, 2, 178.4, 71.0]]
y = [0, 0, 1, 1, 0]
clf = MixedNB(categorical_features=[0,1])
clf.fit(X,y)
clf.predict(X)
Pip installable via pip install mixed-naive-bayes
. More information on the usage in the README.md file. Pull requests are greatly appreciated :)
@Yaron's approach needs an extra step (4. below):
- Calculate the probability from the categorical variables.
- Calculate the probability from the continuous variables.
- Multiply 1. and 2. AND
- Divide 3. by the sum of the product of 1. and 2. EDIT: What I actually mean is that the denominator should be (probability of the event given the hypotnesis is yes) + (probability of evidence given the hypotnesis is no) (asuming a binary problem, without loss of generality). Thus, the probabilities of the hypotheses (yes or no) given the evidence would sum to 1.
Step 4. is the normalization step. Take a look at @remykarem's mixed-naive-bayes
as an example (lines 268-278):
if self.gaussian_features.size != 0 and self.categorical_features.size != 0:
finals = t * p * self.priors
elif self.gaussian_features.size != 0:
finals = t * self.priors
elif self.categorical_features.size != 0:
finals = p * self.priors
normalised = finals.T/(np.sum(finals, axis=1) + 1e-6)
normalised = np.moveaxis(normalised, [0, 1], [1, 0])
return normalised
The probabilities of the Gaussian and Categorical models (t
and p
respectively) are multiplied together in line 269 (line 2 in extract above) and then normalized as in 4. in line 275 (fourth line from the bottom in extract above).