2
votes

How do I build a H2O word2vec training_frame that distinguishes between different document/sentences etc.?

As far as I can read from the very limited documentation I have found, you simply supply one long list of words? Such as

'This' 'is' 'the' 'first' 'This' 'is' 'number' 'two'

However it would make sense to be able to distinguish – ideally something like this:

Name   | ID
This   | 1
is     | 1
the    | 1
first  | 1
This   | 2
is     | 2
number | 2
two    | 2

Is that possible?

1

1 Answers

3
votes

word2vec is a type of unsupervised learning: it turns string data into numbers. So to do a classification you need to do a two-step process:

  • word2vec for strings to numbers
  • any supervised learning technique for numbers to categories

The documentation contains links to a categorization example in each of R and Python. This tutorial shows the same process on a different data set (and there should be a H2O World 2017 video that goes with that).

By the way, in your original example, you don't just supply the words; the sentences are separated by NA. If you give h2o.tokenize() a vector of sentences, it will make this format for you. So your example would actually be:

'This' 'is' 'the' 'first' NA 'This' 'is' 'number' 'two'