3
votes

H2O recently added word2vec in its API. It is great to be able to easily train your own word vectors on a corpus you provide yourself.

However even greater possibilities exist from using big data and big computers, of the type that software vendors like Google or H2O.ai, but not so many end-users of H2O, may have access to, due to network bandwidth and compute power limitations.

Word embeddings can be seen as a type of unsupervised learning. As such, great value can be had in a data science pipeline by using pretrained word vectors that were built on a very large corpus as infrastructure in specific applications. Using general purpose pretrained word vectors can be seen as a form of transfer learning. Reusing word vectors is analogous to computer vision deep learning generic lowest layers that learn to detect edges in photographs. Higher layers detect specific kinds of objects composed from the edge layers below them.

For example Google provides some pretrained word vectors with their word2vec package. The more examples the better is often true with unsupervised learning. Further, sometimes it's practically difficult for an individual data scientist to download a giant corpus of text on which to train your own word vectors. And there is no good reason for every user to recreate the same wheel by training word vectors themselves on the same general purpose corpuses (corpi?) like wikipedia.

Word embeddings are very important and have the potential to be the bricks and mortar of a galaxy of possible applications. TF-IDF, the old basis for many natural language data science applications, stands to be made obsolete by using word embeddings instead.

Three questions:

1 - Does H2O currently provide any general purpose pretrained word embeddings (word vectors), for example trained on text found at legal or other public-owned (government) websites, or wikipedia or twitter or craigslist, or other free or Open Commons sources of human-written text?

2 - Is there a community site where H2O users can share their trained word2vec word vectors that are built on more specialized corpuses, such as medicine and law?

3 - Can H2O import Google's pretrained word vectors from their word2vec package?

1

1 Answers

4
votes

thank you for your questions.

You are absolutely right, there are many situations when you don't need a custom model and pre-trained model will work well. I assume people will mostly build their own models on smaller problems in their specific domain and use pre-trained models to complement the custom model.

You can import 3rd party pre-trained models into H2O as long as they are in a CSV-like format. This is true for many available GloVe models.

To do that import the model into a Frame (just like with any other dataset):

w2v.frame <- h2o.importFile("pretrained.glove.txt")

And then convert it to a regular H2O word2vec model:

w2v.model <- h2o.word2vec(pre_trained = w2v.frame, vec_size = 100)

Please note that you need to provide the size of the embeddings.

H2O doens't plan to provide a model exchange/model market for w2v model as far as I know. You can use models that are available on-line: https://github.com/3Top/word2vec-api

We currently do not support importing Google's binary format of word embeddings, however the support is on our road map as it makes a lot of sense for our users.