I'm trying to compare my implementation of Doc2Vec (via tf) and gensims implementation. It seems atleast visually that the gensim ones are performing better.
I ran the following code to train the gensim model and the one below that for tensorflow model. My questions are as follows:
- Is my tf implementation of Doc2Vec correct. Basically is it supposed to be concatenating the word vectors and the document vector to predict the middle word in a certain context?
- Does the
parameter in gensim mean that I am using two words on either side to predict the middle one? Or is it 5 on either side. Thing is there are quite a few documents that are smaller than length 10. - Any insights as to why Gensim is performing better? Is my model any different to how they implement it?
- Considering that this is effectively a matrix factorisation problem, why is the TF model even getting an answer? There are infinite solutions to this since its a rank deficient problem. <- This last question is simply a bonus.
model = Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=10, hs=0, min_count=2, workers=cores)
epochs = 100
for i in range(epochs):
batch_size = 512
embedding_size = 100 # Dimension of the embedding vector.
num_sampled = 10 # Number of negative examples to sample.
graph = tf.Graph()
with graph.as_default(), tf.device('/cpu:0'):
# Input data.
train_word_dataset = tf.placeholder(tf.int32, shape=[batch_size])
train_doc_dataset = tf.placeholder(tf.int32, shape=[batch_size/context_window])
train_labels = tf.placeholder(tf.int32, shape=[batch_size/context_window, 1])
# The variables
word_embeddings = tf.Variable(tf.random_uniform([vocabulary_size,embedding_size],-1.0,1.0))
doc_embeddings = tf.Variable(tf.random_uniform([len_docs,embedding_size],-1.0,1.0))
softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size, (context_window+1)*embedding_size],
stddev=1.0 / np.sqrt(embedding_size)))
softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
# Model.
# Look up embeddings for inputs and stack words side by side
embed_words = tf.reshape(tf.nn.embedding_lookup(word_embeddings, train_word_dataset),
embed_docs = tf.nn.embedding_lookup(doc_embeddings, train_doc_dataset)
embed = tf.concat(1,[embed_words, embed_docs])
# Compute the softmax loss, using a sample of the negative labels each time.
loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases, embed,
train_labels, num_sampled, vocabulary_size))
# Optimizer.
optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
Check out the jupyter notebook here (I have both models working and tested in here). It still feels like the gensim model is performing better in this initial analysis.
? Couldn't quite get it – Clock Slavedm_concat
mode results in much-larger, slower-to-train models that probably require a lot more data (or training-passes) than the more-commonly-used PV-DBOW or PV-DM-with-context-window-averaging. I initially addeddm_concat
mode to gensim, to try to closely reproduce the 'Paragraph Vector' paper results said to use that mode. (I couldn't; nor has anyone else who's tried.) I haven't personally found any datasets/evaluations wheredm_concat
was worth the extra effort – but maybe they exist with really-big doc corpuses. – gojomo