Scikit-learn: is semi-supervised Naive Bayes implementation available?

Question

I would like to use the implementation of Semi-supervised Naive Bayes (Bernoulli) of Scikit-learn. According to this link in github, there was some work and discussion about it one year ago (class SemisupervisedNB). On the other hand, there seems to be another different implementation (function fit_semi?) which seems it was polished by another user afterwards. However none of them are available in the current stable release.

Could someone show me an example on how could I use one of these two implementations with the current release of scikit-learn in order to build a Semisupervised Naive Bayes ? Thanks.

P.S.: I am using scikit-learn classifiers from NLTK with the class SklearnClassifier

EDIT

I have tried the code of SemiSupervisedNB in my project changing the label for the unlabeled class from -1 to 2 (I am using SKlearnClassifier from NLTK and my unlabeled class gets the label 2). However, I am getting ValueError: array must not contain infs or NaNs when computing d (difference between current and previous params of the model) because the intercept arrays contain inf values... Any idea on how to solve this?

You can try to check out the branch and work on that, but I'm not sure it is in a good state currently - also it is quite behind the current stable. You could try to rebase / merge the branch onto current master. But if you are not familiar with git / the project, you will probably have some issues. Or just wait on @larsmans to comment and tell you what to do ;) — Andreas Mueller
Thanks for your reply. Unfortunately I don't have much time for reviewing code now... I'll wait for @ogrisel as well :) — AM2
I'm sorry, but I really don't have time to fix this up or even instruct you how to do it. The semi-supervised NB should work (a colleague tried just a few months ago) but doesn't tie in with current scikit-learn at all. You could try rebasing it, as @amueller suggested. — Fred Foo
@AM2 Some months ago I opened an issue on GitHub about this topic. I found a way to get the implementation of SemiSupervisedNB working. However, I haven't tested so far whether the described changes to the master branch affect other classifiers or code. Try it with caution! — pemistahl
@Peter Stahl Thanks. I have tried the code of SemiSupervisedNB in my project changing the label for the unlabeled class from -1 to 2 (I am using SKlearnClassifier from NLTK and my unlabeled class gets the label 2). However, I am getting ValueError: array must not contain infs or NaNs when computing d (difference between current and previous params of the model) because the intercept arrays contain inf values... Any idea on how to solve this? — AM2

pemistahl pemistahl · Accepted Answer · 2013-01-31T11:29:24

Some months ago, I opened an issue on GitHub about this topic. It is possible to add the respective code to the current master branch of scikit-learn.

The user @larsmans added an experimental class SemisupervisedNB to the file sklearn/naive_bayes.py around a year ago. This code resides in the branch emnb of his forked scikit-learn repository and can be accessed here.

The essential code resides in two files:

The file naive_bayes.py in the current master branch has to be replaced by the older one from the emnb branch.
An editing of the class LabelBinarizer is necessary which can be found in the file sklearn/preprocessing.py in the master branch. The entire class has to be replaced by its definition in @larsmans' emnb branch. There, it resides in the file sklearn/preprocessing/__init__.py.

Even though the code for the Naive Bayes classifiers have not changed a lot for a year, some bug fixes were added to them. Therefore it makes sense to keep the current versions of the file naive_bayes.py and the class LabelBinarizer and instead to give the experimental versions different names.

I've just created my own fork of the scikit-learn repository and added the experimental files on top of the current stable branch 0.13.X. This branch is called 0.13.X-emnb and can be accessed here. If you look at my three recent commits (1 and 2 and 3), you see which files I've changed and newly created.

Since SemisupervisedNB does not work together with the most recent versions of the other classifiers, I've just added a new module next to naive_bayes.py called semisupervised_naive_bayes.py. In there, you find the older versions of the classifiers in renamed versions, e.g. SemiMultinomialNB instead of MultinomialNB so that they don't clash with the most recent versions of the classifiers. Likewise, I've added a class SemisupervisedLabelBinarizer next to LabelBinarizer (the choice of the name is a bit unfortunate but at least it's clear what it should be used for).

So, if you want to use the semisupervised versions of the classifiers, use the module sklearn.semisupervised_naive_bayes. For the current versions, use the module sklearn.naive_bayes.

But please keep in mind that this is highly experimental. It's just a setting for getting this old code working. I haven't searched for bugs.

Scikit-learn: is semi-supervised Naive Bayes implementation available?

1 Answers