I'm trying to use gensim's (ver 1.0.1) doc2vec to get the cosine similarities of documents. This should be relatively simple, but I'm having problems retrieving the vector of the documents so I can do cosine similarity. When I try to retrieve a document by the label I gave it in training, I get a key error.
For example,
print(model.docvecs['4_99.txt'])
will tell me that there is no such key as 4_99.txt.
However if I print print(model.docvecs.doctags) I see things like this:
'4_99.txt_3': Doctag(offset=1644, word_count=12, doc_count=1)
So it appears that for every document, doc2vec is saving each sentence as the "document name underscore number"
So I'm either
A) training incorrectly or
B) Don't understand how to retrieve the doc vector so that I can do similarity(d1, d2)
Can anyone help me out here?
Here is how I train my doc2vec:
#Obtain txt abstracts and txt patents
filedir = os.path.abspath(os.path.join(os.path.dirname(__file__)))
files = os.listdir(filedir)
#Doc2Vec takes [['a', 'sentence'], 'and label']
docLabels = [f for f in files if f.endswith('.txt')]
sources = {} #{'2_139.txt': '2_139.txt'}
for lable in docLabels:
sources[lable] = lable
sentences = LabeledLineSentence(sources)
model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)
model.build_vocab(sentences.to_array())
for epoch in range(10):
model.train(sentences.sentences_perm())
model.save('./a2v.d2v')
This uses this class
class LabeledLineSentence(object):
def __init__(self, sources):
self.sources = sources
flipped = {}
# make sure that keys are unique
for key, value in sources.items():
if value not in flipped:
flipped[value] = [key]
else:
raise Exception('Non-unique prefix encountered')
def __iter__(self):
for source, prefix in self.sources.items():
with utils.smart_open(source) as fin:
for item_no, line in enumerate(fin):
yield LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no])
def to_array(self):
self.sentences = []
for source, prefix in self.sources.items():
with utils.smart_open(source) as fin:
for item_no, line in enumerate(fin):
self.sentences.append(LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no]))
return self.sentences
def sentences_perm(self):
shuffle(self.sentences)
return self.sentences
I got this class from a web tutorial (https://medium.com/@klintcho/doc2vec-tutorial-using-gensim-ab3ac03d3a1) to help me get around Doc2Vec's weird data formatting requirements and I don't completely understand it to be honest. It does look like this class written here is adding the _n for each sentence, but in the tutorial it seems that they still retrieve the document vector with just giving it the filename... So what am I doing wrong here?