I'm trying to compare my implementation of Doc2Vec (via tf) and gensims implementation. It seems atleast visually that the gensim ones are performing better.
I ran the following code to train the gensim model and the one below that for tensorflow model. My questions are as follows:
- Is my tf implementation of Doc2Vec correct. Basically is it supposed to be concatenating the word vectors and the document vector to predict the middle word in a certain context?
- Does the
window=5
parameter in gensim mean that I am using two words on either side to predict the middle one? Or is it 5 on either side. Thing is there are quite a few documents that are smaller than length 10. - Any insights as to why Gensim is performing better? Is my model any different to how they implement it?
- Considering that this is effectively a matrix factorisation problem, why is the TF model even getting an answer? There are infinite solutions to this since its a rank deficient problem. <- This last question is simply a bonus.
Gensim
model = Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=10, hs=0, min_count=2, workers=cores)
model.build_vocab(corpus)
epochs = 100
for i in range(epochs):
model.train(corpus)
TF
batch_size = 512
embedding_size = 100 # Dimension of the embedding vector.
num_sampled = 10 # Number of negative examples to sample.
graph = tf.Graph()
with graph.as_default(), tf.device('/cpu:0'):
# Input data.
train_word_dataset = tf.placeholder(tf.int32, shape=[batch_size])
train_doc_dataset = tf.placeholder(tf.int32, shape=[batch_size/context_window])
train_labels = tf.placeholder(tf.int32, shape=[batch_size/context_window, 1])
# The variables
word_embeddings = tf.Variable(tf.random_uniform([vocabulary_size,embedding_size],-1.0,1.0))
doc_embeddings = tf.Variable(tf.random_uniform([len_docs,embedding_size],-1.0,1.0))
softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size, (context_window+1)*embedding_size],
stddev=1.0 / np.sqrt(embedding_size)))
softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
###########################
# Model.
###########################
# Look up embeddings for inputs and stack words side by side
embed_words = tf.reshape(tf.nn.embedding_lookup(word_embeddings, train_word_dataset),
shape=[int(batch_size/context_window),-1])
embed_docs = tf.nn.embedding_lookup(doc_embeddings, train_doc_dataset)
embed = tf.concat(1,[embed_words, embed_docs])
# Compute the softmax loss, using a sample of the negative labels each time.
loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases, embed,
train_labels, num_sampled, vocabulary_size))
# Optimizer.
optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
Update:
Check out the jupyter notebook here (I have both models working and tested in here). It still feels like the gensim model is performing better in this initial analysis.
negative
ornum_sampled
? Couldn't quite get it – Clock Slavedm_concat
mode results in much-larger, slower-to-train models that probably require a lot more data (or training-passes) than the more-commonly-used PV-DBOW or PV-DM-with-context-window-averaging. I initially addeddm_concat
mode to gensim, to try to closely reproduce the 'Paragraph Vector' paper results said to use that mode. (I couldn't; nor has anyone else who's tried.) I haven't personally found any datasets/evaluations wheredm_concat
was worth the extra effort – but maybe they exist with really-big doc corpuses. – gojomo