My question concerns the proper training of the model for unique and really specific use of the Word2Vec model. See Word2Vec details here
I am working on identifying noun-adjective (or ) relationships within the word embeddings.
(E.g. we have 'nice car' in a sentence of the data-set. Given the word embeddings of the corpus and the nouns and adjectives all labeled, I am trying to design a technique to find the proper vector that connects 'nice' with 'car'.)
Of course I am not trying to connect only that pair of words, but the technique should would for all relationships. A supervised approach is taken at this moment, then try to work towards designing an unsupervised method.
Now that you understand what I am trying to do, I will explain the problem. I obviously know that word2vec needs to be trained on large amounts of data, to learn the proper embeddings as accurately as possible, but I am afraid to give it more data than the data-set with labelled sentences (500-700).
I am afraid that if I give it more data to train on (e.g. Latest Wikipedia dump data-set), it will learn better vectors, but the extra data will influence the positioning of my words, then this word relationship is biased by the extra training data. (e.g. what if there is also 'nice Apple' in the extra training data, then the positioning of the word 'nice' could be compromised).
Hopefully this makes sense and I am not making bad assumptions, but I am just in the dilemma of having bad vectors because of not enough training data, or having good vectors, but compromised vector positioning in the word embeddings.
What would be the proper way to train on ? As much training data as possible (billions of words) or just the labelled data-set (500-700 sentences) ?
Thank you kindly for your time, and let me know if anything that I explained does not make sense.