Why does word2vec vocabulary length is different from the word vector length

Question

I read the Kaggle’s word2vec example from https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-3-more-fun-with-word-vectors and I can’t understand how come the model’s vocabulary length is different from the word vector length.

Doesn’t every cell at a word vector represent the relation with other word from the vocabulary, so each word has a relation to each other word? If not, so what does each cell at the word vector represent?

Really appreciate any help.

kampta kampta · Accepted Answer · 2016-03-15T16:21:54

Word2Vec captures distributed representation of a word which essentially means, multiple neurons (cells) capture a single concept (concept can be word meaning/sentiment/part of speech etc.), and also a single neuron (cell) contributes to multiple concepts.

These concepts are automatically learnt and not pre-defined, hence you can think of them as latent/hidden.

More is the number of neurons (cells), more will be the capacity of your neural network to represent these concepts, but more data will be required to train these vectors (as they are initialised randomly).

size of word-vector is significantly smaller than vocabulary size (typically), since we want a compressed representation of word. Cosine similarity between two word-vectors indicates similarity between the two words.

EDIT

For more clarity, think of each word being earlier represented by one-hot encoded vector of size of vocabulary which is of the order of thousands/millions. The same word is now condensed into 200 or 300 dimensional vector. In order to find relation between two words, you need to calculate cosine similarity between vector representation of these two words.

Why does word2vec vocabulary length is different from the word vector length

3 Answers