I am relatively new to NLP and I am trying to create my own words embeddings trained in my personal corpus of docs.
I am trying to implement the following code to create my own wordembedings:
model = gensim.models.Word2Vec(sentences)
with sentences being a list of sentences. Since I can not pass thousands and thousands of sentences I need an iterator
# with mini batch_dir a directory with the text files
# MySentences is a class iterating over sentences.
sentences = MySentences(minibatch_dir) # a memory-friendly iterator
I found this solution by the creator of gensim:
class MySentences(object):
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
for fname in os.listdir(self.dirname):
for line in open(os.path.join(self.dirname, fname)):
yield line.split()
It does not work for me. How can I create an iterator if I know how to get the list of sentences from every document?
And second very related question: If I am aiming to compare documents similarity in a particular corpus, is always better to create from scratch word embeddings with all the documents of that particular corpus than using GloVec or word2vec? The amount of docs is around 40000.
cheers
More pre