
Say we have used the TFIDF transform to encode documents into continuous-valued features.

How would we now use this as input to a Naive Bayes classifier?

Bernoulli naive-bayes is out, because our features aren't binary anymore.
Seems like we can't use Multinomial naive-bayes either, because the values are continuous rather than categorical.

As an alternative, would it be appropriate to use gaussian naive bayes instead? Are TFIDF vectors likely to hold up well under the gaussian-distribution assumption?

The sci-kit learn documentation for MultionomialNB suggests the following:

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

Isn't it fundamentally impossible to use fractional values for MultinomialNB?
As I understand it, the likelihood function itself assumes that we are dealing with discrete-counts (since it deals with counting/factorials)

How would TFIDF values even work with this formula?


Technically, you are right. The (traditional) Multinomial N.B. model considers a document D as a vocabulary-sized feature vector x, where each element xi is the count of term i i document D. By definition, this vector x then follows a multinomial distribution, leading to the characteristic classification function of MNB.

When using TF-IDF weights instead of term counts, our feature vectors are (most likely) not following a multinomial distribution anymore, so the classification function is not theoretically well-founded anymore. However, it does turn out that tf-idf weights instead of counts work (much) better.

In the exact same way, except that the feature vector x is now a vector of tf-idf weights and not counts.

You can also check out the Sublinear tf-idf weighting scheme, implemented in sklearn tfidf-vectorizer. In my own research I found this one performing even better: it uses a logarithmic version of the term frequency. The idea is that when a query term occurs 20 times in doc. a and 1 time in doc. b, doc. a should (probably) not be considered 20 times as important but more likely log(20) times as important.