2
votes

I'm trying to solve the following with the IBM Watson Natural Language Classifier on IBM Bluemix:

I have N training documents D labeled with labels l_x_y of different Label Sets S_1 to S_n. Where x defines the label set and y the actual label within the set. Each document can be labeled with multiple labels (coming from different Label Sets).

Here an Example:

Label Set 1 : S_1={a,b,c,d,e,f} Label Set 2 : S_2={1,2,3,4,5,6}

D_1 = "This is some text", {a,c,e,1,3,4} D_2 = "This is some text2", {d,f,4}

If I understood correctly the REST service is capable of being trained with multiple classes. The naive approach would be to just train a different classifier for each label set.

But is there a better way to do this? E.g. can I use the union of the labels of each set (as illustrated in D_1 and D_2) ?

Because the API Documentation says the following about the response:

An array [Classes] of up to ten class_name-confidence pairs that are sorted in descending order of confidence. If there are fewer than 10 classes, the sum of the confidence values is 100%.

So this means if the cardinality of the union of all label sets is > 10 it might omit low confidence classes, but is there any other issue with using the union of the label sets?

1

1 Answers

3
votes

The data format specifies that each column after the "text" will be considered as a class label. If you send the training data as (in your case):

"This is some text", "{a,c,e,1,3,4}"

"This is some text2", "{d,f,4}"

then, the service assumes there are two unique classes in the training data: {a,c,e,1,3,4} and {d,f,4}.

However, you may try training on multiple labels by creating a training data like:

"This is some text", a,c,e,1,3,4

"This is some text2", d,f,4

in which case, you are training on 8 unique classes. Hence, the classification output will contain the confidence values for these classes. It is up to you to categorize the resulting classes in either of those label sets.