2
votes

I getting memory error, when I use GoogleNews-vectors-negative300.bin or try to train a model with Gensim with wikipedia dataset corpus.(1 GB). I have 4GB RAM in my system. Is there any way to bypass this.

Can we host it on cloud service like AWS to get better speed ?

3

3 Answers

6
votes

4GB is very tight for that vector set; you should have 8GB or more to load the full set. Alternatively you could use the optional limit argument to load_word2vec_format() to just load some of the vectors. For example, limit=500000 would load just the first 500,000 (instead of the full 3 million). As the file appears to put the more-frequently-appearing tokens first, that may be sufficient for many purposes.

0
votes

No way to get away with 4G. I could load and compute GoogleNews-vectors-negative300.bin on my 8G RAM Macbook Pro. However, when I loaded this gigantic pretrained vector on AWS, I had to upgrade it to 16G RAM because it was serving a webapp at the same time. So basically if you want to use it on webapp with the safety margin, you would need 16G.

0
votes

It is really difficult to load the entire Google-News-Vector pre-trained model. I was able to load about 50,000 (i.e 1/60 th of total) on my 8 GB Ubuntu machine using Jupyter Notebook. Yes, as expected, the memory/resource usage touched the roof. So, it is safe to use atleast 16 GB to load the entire model, otherwise use limit=30000 as a parameter, as suggested by @gojomo.