How can I handle a JSON File in a multi label classification problem?

Question

I have a multi-label problem. I have read many tutorials and all work with CSV. But I have a JSON. An image can have one to three labels. This is what the JSON looks like: {"PIC_NAME": ["Label1"], "PIC_NAME": ["Label2", "Label6"], "PIC_NAME": ["Label20"], "PIC_NAME": ["Label4"], "PIC_NAME": ["Label5"], "PIC_NAME": ["Label1"], "PIC_NAME": ["Label15"], ...

The CSV work with binary labels. But I only have Strings. There are 20 different labels. If it should be like in the tutorials, then a picture should be marked with 23 binary numbers. If the label (for example Label1) is true, there is a 1 and all other labels are set to 0. I work with Keras.

Does anyone have any idea how I can solve the problem with a JSON? This is an example of a tutorial I have read: https://www.analyticsvidhya.com/blog/2019/04/build-first-multi-label-image-classification-model-python/

As an example we have a picture with a cat, a dog and a bird. The picture shows a dog and a bird. Then it should look like this: 0 1 1. Because there is no cat in the picture, the first value is 0. I wish it could look like in the tutorial above

Baptiste Pouthier Baptiste Pouthier · Accepted Answer · 2019-07-30T12:28:50

If i understood your problem, you want to replace ["Label1"] by [1 0 0 ... ], i.e. one hot encoding your labels?

If yes, you can for example look at this where there is a multi-labels problem.

So you can for example do something like that:

from sklearn.preprocessing import MultiLabelBinarizer

labels = [("blue", "jeans"),("blue", "dress"),("red", "dress"),("red", "shirt"), 
         ("blue", "shirt"),("black", "jeans")]

mlb = MultiLabelBinarizer()
labels = mlb.fit_transform(labels)

print(labels)

It prints:

Then you have your labels one-hot encoded.

In your problem you'll have ["Label2", "Label6"] instead of clothing.

EDIT: if you have only one label instead of two, it's also working:

from sklearn.preprocessing import MultiLabelBinarizer

labels = [("blue",),("blue", "dress"),("red", "dress"),("red", "shirt"), 
         ("blue", "shirt"),("black", "jeans")]

mlb = MultiLabelBinarizer()
labels = mlb.fit_transform(labels)

print(labels)

To have an index of your classes, you can use:

print(mlb.classes_)

EDIT2:

For your example:

from sklearn.preprocessing import MultiLabelBinarizer

labels = [("Label1",),("Label2",),("Label3",),("Label4","Label1"),        
         ("Label4","Label5")]

mlb = MultiLabelBinarizer()
labels = mlb.fit_transform(labels)

print(labels)

print(mlb.classes_)

EDIT3:

These will work:

labels = [["Label1"],["Label2"],["Label3"],["Label4","Label1"], 
         ["Label4","Label5"]]

labels = [("Label1",),("Label2",),("Label3",),("Label4","Label1"), 
         ("Label4","Label5")]

This won't (without commas):

labels = [("Label1"),("Label2"),("Label3"),("Label4","Label1"), 
         ("Label4","Label5")]

How can I handle a JSON File in a multi label classification problem?

1 Answers