1
votes

I read the Kaggle’s word2vec example from https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-3-more-fun-with-word-vectors and I can’t understand how come the model’s vocabulary length is different from the word vector length.

Doesn’t every cell at a word vector represent the relation with other word from the vocabulary, so each word has a relation to each other word? If not, so what does each cell at the word vector represent?

Really appreciate any help.

3

3 Answers

1
votes

Word2Vec captures distributed representation of a word which essentially means, multiple neurons (cells) capture a single concept (concept can be word meaning/sentiment/part of speech etc.), and also a single neuron (cell) contributes to multiple concepts.

These concepts are automatically learnt and not pre-defined, hence you can think of them as latent/hidden.

More is the number of neurons (cells), more will be the capacity of your neural network to represent these concepts, but more data will be required to train these vectors (as they are initialised randomly).

size of word-vector is significantly smaller than vocabulary size (typically), since we want a compressed representation of word. Cosine similarity between two word-vectors indicates similarity between the two words.

EDIT

For more clarity, think of each word being earlier represented by one-hot encoded vector of size of vocabulary which is of the order of thousands/millions. The same word is now condensed into 200 or 300 dimensional vector. In order to find relation between two words, you need to calculate cosine similarity between vector representation of these two words.

1
votes

word2vec embeds words in a vector space whose dimension is user-defined. For computation and performance reasons, this dimension is often rather small (ranging between 50-1000).

In fact, this excellent paper by Levy and Goldberg shows that word2vec efficiently computes a factorization of a PMI matrix, which is similar to the one you have in mind. Therefore, each coordinate in a word embedding can be interpreted as quantifying some unknown linear relation to multiple (if not all) context-words, not just one.

1
votes

Previous answers mention performance and computation costs as the reason for having vector sizes smaller than the vocabulary size. If the vector was is not the relationship with all the other words in the vocabulary, then I wanted to know what it really is.

Some of the earlier algorithms did create the full size word vectors and then shrink them down using linear algebra. The condensed feature vectors were then fed into neural networks.

word2vec has condensed this process into one step and builds the word vectors in the hidden layer of its neural network. The size of the word vector corresponds to the number of nodes in the hidden layer.

A longer version of this with sources is available here