0
votes

I'm trying to concatenate two model in one. I have a BERT model and EfficientNet model.

input_text = model_Bert.inputs[:2]
text = model_Bert(input_text)
input_img = model_EfNet.layers[0].input
img = model_EfNet(input_img)
concatenated = layers.concatenate([text, img], axis=1) #same dimension
dense = layers.Dense(512, activation='relu')(concatenated)
dense = layers.Dense(128, activation='relu')(dense)
dense = layers.Dropout(0.3)(dense)
outputs = layers.Dense(2, activation='softmax', name = 'real_output')(dense)

model_Multimodal = keras.models.Model(inputs=[input_text, input_img], outputs=outputs)

But I got this error:


ValueError Traceback (most recent call last) in 9 outputs = layers.Dense(2, activation='softmax', name = 'real_output')(dense) 10 ---> 11 model_Multimodal = keras.models.Model(inputs=[input_text, input_img], outputs=outputs)

~/anaconda3/lib/python3.7/site-packages/keras/legacy/interfaces.py in wrapper(*args, **kwargs) 89 warnings.warn('Update your ' + object_name + ' call to the ' + 90 'Keras 2 API: ' + signature, stacklevel=2) ---> 91 return func(*args, **kwargs) 92 wrapper._original_function = func 93 return wrapper

~/anaconda3/lib/python3.7/site-packages/keras/engine/network.py in init(self, *args, **kwargs) 92 'inputs' in kwargs and 'outputs' in kwargs): 93 # Graph network ---> 94 self._init_graph_network(*args, **kwargs) 95 else: 96 # Subclassed network

~/anaconda3/lib/python3.7/site-packages/keras/engine/network.py in _init_graph_network(self, inputs, outputs, name, **kwargs) 167 'must come from keras.layers.Input. ' 168 'Received: ' + str(x) + --> 169 ' (missing previous layer metadata).') 170 # Check that x is an input tensor. 171 layer, node_index, tensor_index = x._keras_history

ValueError: Input tensors to a Model must come from `keras.layers.Input`. Received: [<tf.Tensor 'Input-Token_1:0' shape=(None, 128) dtype=float32>, <tf.Tensor 'Input-Segment_1:0' shape=(None, 128) dtype=float32>] (missing previous layer metadata).
1
I've tried to modify the lines 2,4 like this: text = model_Bert.layers[105].output img = model_EfNet.layers[11].output But the error is the same.Elia Fabbris

1 Answers

0
votes

It is difficult to tell from your code without knowing the implementation that you're using for BERT. In particular, is it a headless model? My understanding (which could be incorrect) is that BERT without a head provides an output of shape (batch_size, seq_length, embedding_dim). In other words, the output shape is variable with the input size. As such, I do not see how your image embedding will always be of the same shape as the output from your transformer.

I'm using the transformers library and a pre-trained CNN to create a bimodal language model in roughly the following fashion:

def call(encoded_sentences, img_embeddings, **transformer_kwargs):

        outputs = transformer(encoded_sentences, **transformer_kwargs)
        last_hidden_state, *_ = outputs
        batch_size, batch_seq_len, last_hidden_dim = last_hidden_state.shape

        # reshape and repeat image embeddings
        batch_size, *img_emb_shape = img_embeddings.shape
        img_emb_flattened = tf.reshape(img_embeddings, 
                                       (batch_size, np.prod(img_emb_shape)))
        emb_flattened_reps = tf.repeat(tf.expand_dims(img_emb_flattened, 1), 
                                       batch_seq_len, axis=1)

        # concatenate the language and image embeddings
        embs_concat = tf.concat([last_hidden_state, emb_flattened_reps], 2)

        # generate mlm predictions over input sequence
        training = transformer_kwargs.get('training', False)
        prediction_scores = mlm_head(embs_concat, training=training)

        # Add hidden states and attention if they are here
        outputs = (prediction_scores,) + outputs[2:]

        return outputs

Where in my case, transformer is a TFAlbertModel, mlm_head is a TFAlbertMLMHead (if this code seems a bit confusing its because I've adapted it from a class method in a tf.keras.Model). Also, keep in mind that the encoded_sentences have been passed through a tokenizer and the img_embeddings are extracted from a pre-trained CNN.

See this repository for my notebook exploring this method for image captioning.