0
votes

The Word2Vec object in gensim has a null_word parameter that isn't explained in the docs.

class gensim.models.word2vec.Word2Vec(sentences=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=, iter=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000)

What is the null_word parameter used for?

Checking the code at https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py#L680, it states:

    if self.null_word:
        # create null pseudo-word for padding when using concatenative L1 (run-of-words)
        # this word is only ever input – never predicted – so count, huffman-point, etc doesn't matter
        word, v = '\0', Vocab(count=1, sample_int=0)
        v.index = len(self.wv.vocab)
        self.wv.index2word.append(word)
        self.wv.vocab[word] = v

What is "concatenative L1"?

1

1 Answers

1
votes

The null_word is only used if using the PV-DM with concatenation mode – parameters dm=1, dm_concat=1 in model initialization.

In this non-default mode, the doctag-vector and the vectors of the neighboring words within window positions of a target word are concatenated into a very-wide input layer, rather than the more typical averaging.

Such models are much larger and slower than other modes. In the case of target words near the beginning or end of a text example, there might not be enough neighboring words to create this input layer – but the model requires values for those slots. So the null_word is essentially used as padding.

While the original Paragraph Vectors paper mentioned using this mode in some of their experiments, this mode is not sufficient to reproduce their results. (No one that I know of has been able to reproduce those results, and other comments from one of the authors imply that the original paper has some error or omission in its process.)

Additionally, I haven't found cases where this mode offers a clear benefit to justify the added time/memory. (It might require very-large datasets or very-long training times to show any benefit.)

So you shouldn't be too concerned about this model property unless you're doing advanced experiments with this less-common mode – in which case you can review the source for all the fine details about how it's used as padding.