I'm trying to solve the following with the IBM Watson Natural Language Classifier on IBM Bluemix:
I have N training documents D labeled with labels l_x_y of different Label Sets S_1 to S_n. Where x defines the label set and y the actual label within the set. Each document can be labeled with multiple labels (coming from different Label Sets).
Here an Example:
Label Set 1 : S_1={a,b,c,d,e,f} Label Set 2 : S_2={1,2,3,4,5,6}
D_1 = "This is some text", {a,c,e,1,3,4} D_2 = "This is some text2", {d,f,4}
If I understood correctly the REST service is capable of being trained with multiple classes. The naive approach would be to just train a different classifier for each label set.
But is there a better way to do this? E.g. can I use the union of the labels of each set (as illustrated in D_1 and D_2) ?
Because the API Documentation says the following about the response:
An array [Classes] of up to ten class_name-confidence pairs that are sorted in descending order of confidence. If there are fewer than 10 classes, the sum of the confidence values is 100%.
So this means if the cardinality of the union of all label sets is > 10 it might omit low confidence classes, but is there any other issue with using the union of the label sets?