1
votes

I have a dataset that includes both images and text features. The labels for the training data is a 2 dimensional array, the same shape as the input images, of 1s/0s.

So basically, the training inputs are:

  • Input image with shape of (X,Y),
  • Additional feature set (i.e. text features) with shape (Z,).

And training labels have the shape of (X,Y).

I am trying to train a model using Tensorflow/Keras on this data. I know I can train a model where the input size is (X* Y) + Z, but I read that isn't the best way to handle mixing image/additional-data features.

So my questions are:

1) How would I set up my model to handle the mixed input types?

2) Since my output is the same size as my image, would I need to define a (X * Y) sized output layer? How would I specify the output layer so that it can take multiple values, that is, any/multiple location in the output can be 1 or 0?

1

1 Answers

2
votes

One way is to define two independent sub-models for processing text and image data and then merge the output of those sub-models to create the final model:

---------------        ---------------
- Input Image -        - Input Text  -
---------------        ---------------
       |                       |
       |                       |
       |                       |
---------------        ---------------------  
- Image Model -        -     Text Model    -
- (e.g. CNNs) -        - (e.g. Embeddings, -
---------------        -  LSTM, Conv1D)    -
       \               ---------------------
        \                     /
         \                   /
          \                 /
           \               /
            \             /
             \           /
              \         /
               \       /
           ----------------------
           -      Merge         -
           - (e.g. concatenate) -
           ----------------------
                     |
                     |
                     |
           ----------------------
           -      Upsample      -
           - (e.g. Dense layer, -
           -   transpose-conv)  -
           ----------------------
                     |
                     |
                     |
                -----------
                -  Output -
                -----------

Each of those boxes corresponds to one or several layers and you may have different ways of implementing them and setting their parameters, though I have mentioned some suggestions in each box.