how to train Word2Vec model properly for a special purpose

Question

My question concerns the proper training of the model for unique and really specific use of the Word2Vec model. See Word2Vec details here

I am working on identifying noun-adjective (or ) relationships within the word embeddings.

(E.g. we have 'nice car' in a sentence of the data-set. Given the word embeddings of the corpus and the nouns and adjectives all labeled, I am trying to design a technique to find the proper vector that connects 'nice' with 'car'.)

Of course I am not trying to connect only that pair of words, but the technique should would for all relationships. A supervised approach is taken at this moment, then try to work towards designing an unsupervised method.

Now that you understand what I am trying to do, I will explain the problem. I obviously know that word2vec needs to be trained on large amounts of data, to learn the proper embeddings as accurately as possible, but I am afraid to give it more data than the data-set with labelled sentences (500-700).

I am afraid that if I give it more data to train on (e.g. Latest Wikipedia dump data-set), it will learn better vectors, but the extra data will influence the positioning of my words, then this word relationship is biased by the extra training data. (e.g. what if there is also 'nice Apple' in the extra training data, then the positioning of the word 'nice' could be compromised).

Hopefully this makes sense and I am not making bad assumptions, but I am just in the dilemma of having bad vectors because of not enough training data, or having good vectors, but compromised vector positioning in the word embeddings.

What would be the proper way to train on ? As much training data as possible (billions of words) or just the labelled data-set (500-700 sentences) ?

Thank you kindly for your time, and let me know if anything that I explained does not make sense.

It's unclear what's unique/specific about your goal. What kind of relationship between 'nice' and 'car' are you expecting? Why is part-of-speech labeling important? Are you sure plain word2vec on part-of-speech-unlabeled text isn't sufficient? Note that 500-700 sentences is tiny for this sort of model – good results come from millions (or billions) of training words, especially to achieve word-vectors with hundreds of dimensions, and good vectors for less-common words. — gojomo
What I am looking at is opinion phrases. An opinion has a feature (e.g. 'car') and a feature descriptor (e.g. 'nice'). I did not go into details about the specifics, but I am trying to perform feature-based opinion mining (original paper: Hu, Minqing, and Bing Liu. "Mining opinion features in customer reviews." AAAI. Vol. 4. No. 4. 2004.). I have labelled features and feature descriptors, also the original text data, and I know 500-700 sentences is not enough, but training on more data would introduce extra noise to positioning of the feature and feature descriptor, which I am trying to avoid — Uther Pendragon
@gojomo please read sophros 's answer to understand the dilemma between training on the labelled data-set only and having bad vectors vs training on as much data as possible + labelled data-set and introducing noise that is unrelated to the labelled data-set's semantic meaning — Uther Pendragon
You may want to look at FastText's classification options – where word-vecs are trained to be good at predicting classes, not just neighboring words. Still, you'll want a lot more data. Data of a similar domain (reviews), even without sentiment-labels, might still be helpful in fleshing out the words, in a way that doesn't bring in word-noise from different domains. — gojomo

sophros sophros · Accepted Answer · 2017-05-24T11:04:27

As always in similar situations it is best to check...

I wonder if you tested the difference in training on the labelled dataset results vs. the wikipedia dataset. Are there really the issues you are afraid of seeing?

I would just run an experiment and check if the vectors in both cases are indeed different (statistically speaking).

I suspect that you may introduce some noise with larger corpus but more data may be beneficial wrt. to vocabulary coverage (larger corpus - more universal). It all depends on your expected use case. It is likely to be a trade off between high precision with very low recall vs. so-so precision with relatively good recall.

how to train Word2Vec model properly for a special purpose

1 Answers