8
votes

I have download 100 billion word Google news pretrained vector file. On top of that i am also training my own 3gb data producing another pretrained vector file. Both has 300 feature dimensions and more than 1gb size.

How do i merge these two huge pre-trained vectors? or how do i train a new model and update vectors on top of another? I see that C based word2vec does not support batch training.

I am looking to compute word analogy from these two models. I believe that vectors learned from these two sources will produce pretty good results.

2

2 Answers

13
votes

There's no straightforward way to merge the end-results of separate training sessions.

Even for the exact same data, slight randomization from initial seeding or thread scheduling jitter will result in diverse end states, making vectors only fully comparable within the same session.

This is because every session finds a useful configuration of vectors... but there are many equally useful configurations, rather than a single best.

For example, whatever final state you reach has many rotations/reflections that can be exactly as good on the training prediction task, or perform exactly as well on some other task (like analogies-solving). But most of these possible alternatives will not have coordinates that can be mixed-and-matched for useful comparisons against each other.

Preloading your model with data from prior training runs might improve the results after more training with new data, but I'm not aware of any rigorous testing of this possibility. The effect likely depends on your specific goals, your parameter choices, and how much the new and old data are similar, or representative of the eventual data against which the vectors will be used.

For example, if the Google News corpus is unlike your own training data, or the text you'll be using the word-vectors to understand, using it as a starting point might just slow or bias your training. On the other hand, if you train on your new data long enough, eventually any influence of the preloaded values could be diluted to nothingness. (If you really wanted a 'blended' result, you might have to simultaneously train on the new data with an interleaved goal for nudging the vectors back towards the prior-dataset values.)

Ways to combine the results from independent sessions might make a good research project. Maybe the method used in the word2vec language-translation projects – learning a projection between vocabulary spaces – could also 'translate' between the different coordinates of different runs. Maybe locking some vectors in place, or training on the dual goals of 'predict the new text' and 'stay close to the old vectors' would give meaningfully improved combined results.

3
votes

Those are my methods:

  • Download the crops from Google news and merge them into your data, then train them!

  • Divide your data set into 2 equal size data set, then train both of them. Now you have 3 models, so you can use blending method to predict.

I hope these may help you!