Special function on feature maps of convolutional layer

Question

In Short:

How do I pass feature maps from convolutional layer defined in Keras to a special function (region proposer) which is then passed to other Keras layers (e.g Softmax classifier)?

Long:

I'm trying to implement something like Fast R-CNN (not Faster R-CNN) in Keras. The reason for this is because I'm trying to implement a custom architecture as seen in the figure below:

Here's the code for the figure above (excluding candidates input):

from keras.layers import Input, Dense, Conv2D, ZeroPadding2D, MaxPooling2D, BatchNormalization, concatenate
from keras.activations import relu, sigmoid, linear
from keras.initializers import RandomUniform, Constant, TruncatedNormal, RandomNormal, Zeros

#  Network 1, Layer 1
screenshot = Input(shape=(1280, 1280, 0),
                   dtype='float32',
                   name='screenshot')
conv1 = Conv2D(filters=96,
               kernel_size=11,
               strides=(4, 4),
               activation=relu,
               padding='same')(screenshot)
pooling1 = MaxPooling2D(pool_size=(3, 3),
                        strides=(2, 2),
                        padding='same')(conv1)
normalized1 = BatchNormalization()(pooling1)  # https://stats.stackexchange.com/questions/145768/importance-of-local-response-normalization-in-cnn

# Network 1, Layer 2

conv2 = Conv2D(filters=256,
               kernel_size=5,
               activation=relu,
               padding='same')(normalized1)
normalized2 = BatchNormalization()(conv2)
conv3 = Conv2D(filters=384,
               kernel_size=3,
               activation=relu,
               padding='same',
               kernel_initializer=RandomNormal(stddev=0.01),
               bias_initializer=Constant(value=0.1))(normalized2)

# Network 2, Layer 1

textmaps = Input(shape=(160, 160, 128),
                 dtype='float32',
                 name='textmaps')
txt_conv1 = Conv2D(filters=48,
                   kernel_size=1,
                   activation=relu,
                   padding='same',
                   kernel_initializer=RandomNormal(stddev=0.01),
                   bias_initializer=Constant(value=0.1))(textmaps)

# (Network 1 + Network 2), Layer 1

merged = concatenate([conv3, txt_conv1], axis=-1)
merged_padding = ZeroPadding2D(padding=2, data_format=None)(merged)
merged_conv = Conv2D(filters=96,
                     kernel_size=5,
                     activation=relu, padding='same',
                     kernel_initializer=RandomNormal(stddev=0.01),
                     bias_initializer=Constant(value=0.1))(merged_padding)

As seen above, final step of the network that I'm trying to build is ROI Pooling, which is done this way in R-CNN:

Now there is a code for ROI Pooling layer in Keras, but to that layer I need to pass region proposals. As you may already know, region proposals are usually done by the algorithm known as Selective Search, which is already implemented in the Python.

Problem:

Selective Search can easily pick up a normal image and give us region proposals like this:

Now the problem is, that instead of image I should pass a feature map, from the layer merged_conv1 as seen in the code above:

merged_conv = Conv2D(filters=96,
                     kernel_size=5,
                     activation=relu, padding='same',
                     kernel_initializer=RandomNormal(stddev=0.01),
                     bias_initializer=Constant(value=0.1))(merged_padding)

The layer above is nothing but a reference to shape, so obviously it wont work with selectivesearch:

>>> import selectivesearch
>>> selectivesearch.selective_search(merged_conv, scale=500, sigma=0.9, min_size=10)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/somepath/selectivesearch.py", line 262, in selective_search
    assert im_orig.shape[2] == 3, "3ch image is expected"
AssertionError: 3ch image is expected

I guess I should do:

from keras import Model
import numpy as np
import cv2
import selectivesearch
img = cv2.imread('someimage.jpg')
img = img.reshape(-1, 1280, 1280, 3)
textmaps = np.ones(-1, 164, 164, 128)  # Just for example
model = Model(inputs=[screenshot, textmaps], outputs=merged_conv)
model.compile(optimizer='sgd', loss='binary_crossentropy', metrics=['accuracy'])
feature_maps = np.reshape(model.predict([img, textmaps]), (96, 164, 164))
feature_map_1 = feature_maps[0][0]
img_lbl, regions = selectivesearch.selective_search(feature_map_1, scale=500, sigma=0.9, min_size=10)

But then what If I want to add let's say softmax classifier which accepts "regions" variable? (btw I am aware that there are few problems with selective search taking anything other than input of channel 3, but this is not relevant to the question)

Question:

Region proposal (using selective search) is an important part the neural network, how can I modify it so that it takes feature maps (activations) from convolutional layer merged_conv?

Maybe I should create my own Keras layer?

You can try modifying selectivesearch file as per the dims of your feature map. It was written for 3 channel input image. Adapting this, you easily pass it through roi-pooling. — Ankish Bansal
@AnkishBansal Thank you for the response. Yes, that is a second problem. In fact, I'm not aware whether or not the feature map was extracted properly - you can see it in my last code block where I take one single feature map as example. Should I take each feature map (with shape (164, 164)) and pass it to selectivesearch? Or should I modify selectivesearch so that it accepts full input of shape (164, 164, 96) channels? Thank you again. — ShellRox

Ankish Bansal Ankish Bansal · Accepted Answer · 2019-01-25T17:02:46

To my best understanding, selective-search take an input and return n no of patches of different (H,W). So in your case, feature-map is of dims (164,164,96), you can assume (164,164) as the input for selective-search and it will give you n number of patch, for exp as (H1,W1), (H2,W2),.... So you can now append all the channel as it is, to that patch, so it becomes as of dims (H1,W1,96),(H2,W2,96),.....

Note: But there is downside of doing this too. Selective-Search algorithm use the strategy in which it breaks the image in grids and then re-join those patch as per the heatmap of the object. You would not be able to do that on feature-map. But you can use random search method on that and it can be useful.