2
votes

I am relatively new to NLP and I am trying to create my own words embeddings trained in my personal corpus of docs.

I am trying to implement the following code to create my own wordembedings:

model = gensim.models.Word2Vec(sentences)

with sentences being a list of sentences. Since I can not pass thousands and thousands of sentences I need an iterator

# with mini batch_dir a directory with the text files
# MySentences is a class iterating over sentences.
sentences = MySentences(minibatch_dir) # a memory-friendly iterator

I found this solution by the creator of gensim:

class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.split()

It does not work for me. How can I create an iterator if I know how to get the list of sentences from every document?

And second very related question: If I am aiming to compare documents similarity in a particular corpus, is always better to create from scratch word embeddings with all the documents of that particular corpus than using GloVec or word2vec? The amount of docs is around 40000.

cheers

More pre

1
What do you mean "does not work for me"? (Is there an error? Unsatisfactory results?) What is the initial format of your data? (Many examples assume, and work best, if you can create a plain-text file where each "document" is on its own line.) While using an iterator that works from giant files on disk is a good idea if you have a giant corpus, just 40K docs isn't very large – quite small for Word2Vec/Doc2Vec – and probably fits in memory as a list just fine. That is, you can "pass thousands and thousands of sentences".gojomo
(For your second "is it better" question, it isn't completely clear what 2 options you're considering. If you have enough of your own data, the resulting word-embeddings will typically be better than those imported from elsewhere – you'll be sure to have vectors for all of your words, and in their domain-specific senses. But depending on subtleties of your data & ultimate goals, vectors from elsewhere might help – there's no "generally better" for all situations.)gojomo
a) even if I could the code should work for others as well who might have around 250000 docs, x 50 pages x 30 lines= 25x10^6 sentences. 26 million sentences list... I will try it. But as far as I can understand passing sentence to sentence might allow me to create the words embedding bit a bit. But you are right I might try to it brute force and pass all in one chunk. Generating a list in memory with 26 million sentences might take a while (generating sentences out of the bunch of docs is quite time consuming; sentence segmentation with NLTK and spacy needs a lot of tweaks to avoid errors)JFerro
b) The two options are, getting my 40000 docs and create the word vectors myself (embeddings) OR b) using the word vectors of Glovec or Word2vec.JFerro
Yes, if you have 26 million sentences, an iterator from disk files is a better solution, and should work fine as in the many online examples of doing it that way. So if you continue to have problems, clearly state what errors/failures you're encountering, and the format of your source data – and the problems should be easy to resolve.gojomo

1 Answers

3
votes

Your illustrated class MySentences assumes one sentence per line. That might not be the case for your data.

One thing to note is - calling Word2Vec(sentences, iter=1) will run two passes over the sentences iterator (or, in general iter+1 passes; default iter=5). The first pass collects words and their frequencies to build an internal dictionary tree structure. The second and subsequent passes train the neural model. These two (or, iter+1) passes can also be initiated manually, in case your input stream is non-repeatable (you can only afford one pass), and you’re able to initialize the vocabulary some other way:

model = gensim.models.Word2Vec(iter=1)  # an empty model, no training yet
model.build_vocab(some_sentences)  # can be a non-repeatable, 1-pass generator
model.train(other_sentences)  # can be a non-repeatable, 1-pass generator

For example, if you are trying to read dataset stored in a database, your generator function to stream text directly from a database, will throw TypeError:

TypeError: You can't pass a generator as the sentences argument. Try an iterator.

A generator can be consumed only once and then it’s forgotten. So, you can write a wrapper which has an iterator interface but uses the generator under the hood.

class SentencesIterator():
    def __init__(self, generator_function):
        self.generator_function = generator_function
        self.generator = self.generator_function()

    def __iter__(self):
        # reset the generator
        self.generator = self.generator_function()
        return self

    def __next__(self):
        result = next(self.generator)
        if result is None:
            raise StopIteration
        else:
            return result

The generator function is stored as well so it can reset and be used in Gensim like this:

from gensim.models import FastText

sentences = SentencesIterator(tokens_generator)
model = FastText(sentences)