How to map input image with neurons in first conv layer in CNN?

Question

I just completed ANN course and started learning CNN. I have basic understanding of padding and stride operation works in CNN.

But have difficultly in mapping input image with neurons in first conv layer but i have basic understanding of how input features are mapped to first hidden layer in ANN.

What is best way of understanding mapping between input image with neurons in first conv layer?

How can I clarify my doubts about the below code example? Code is taken from DL course in Coursera.

    def initialize_parameters():
        """
        Initializes weight parameters to build a neural network with tensorflow. The shapes are:
                            W1 : [4, 4, 3, 8]
                            W2 : [2, 2, 8, 16]
        Returns:
        parameters -- a dictionary of tensors containing W1, W2
        """

        tf.set_random_seed(1)                              # so that your "random" numbers match ours

        ### START CODE HERE ### (approx. 2 lines of code)
        W1 = tf.get_variable("W1",[4,4,3,8],initializer = tf.contrib.layers.xavier_initializer(seed = 0))
        W2 = tf.get_variable("W2",[2,2,8,16],initializer = tf.contrib.layers.xavier_initializer(seed = 0))
        ### END CODE HERE ###

        parameters = {"W1": W1,
                      "W2": W2}

        return parameters


     def forward_propagation(X, parameters):
        """
        Implements the forward propagation for the model:
        CONV2D -> RELU -> MAXPOOL -> CONV2D -> RELU -> MAXPOOL -> FLATTEN -> FULLYCONNECTED

        Arguments:
        X -- input dataset placeholder, of shape (input size, number of examples)
        parameters -- python dictionary containing your parameters "W1", "W2"
                      the shapes are given in initialize_parameters

        Returns:
        Z3 -- the output of the last LINEAR unit
        """

        # Retrieve the parameters from the dictionary "parameters" 
        W1 = parameters['W1']
        W2 = parameters['W2']

        ### START CODE HERE ###
        # CONV2D: stride of 1, padding 'SAME'
        Z1 = tf.nn.conv2d(X,W1, strides = [1,1,1,1], padding = 'SAME')
        # RELU
        A1 = tf.nn.relu(Z1)
        # MAXPOOL: window 8x8, sride 8, padding 'SAME'
        P1 = tf.nn.max_pool(A1, ksize = [1,8,8,1], strides = [1,8,8,1], padding = 'SAME')
        # CONV2D: filters W2, stride 1, padding 'SAME'
        Z2 = tf.nn.conv2d(P1,W2, strides = [1,1,1,1], padding = 'SAME')
        # RELU
        A2 = tf.nn.relu(Z2)
        # MAXPOOL: window 4x4, stride 4, padding 'SAME'
        P2 = tf.nn.max_pool(A2, ksize = [1,4,4,1], strides = [1,4,4,1], padding = 'SAME')
        # FLATTEN
        P2 = tf.contrib.layers.flatten(P2)
        # FULLY-CONNECTED without non-linear activation function (not not call softmax).
        # 6 neurons in output layer. Hint: one of the arguments should be "activation_fn=None" 
        Z3 = tf.contrib.layers.fully_connected(P2, 6,activation_fn=None)
        ### END CODE HERE ###

        return Z3

    with tf.Session() as sess:
        np.random.seed(1)
        X, Y = create_placeholders(64, 64, 3, 6)
        parameters = initialize_parameters()
        Z3 = forward_propagation(X, parameters)
        init = tf.global_variables_initializer()
        sess.run(init)
        a = sess.run(Z3, {X: np.random.randn(1,64,64,3), Y: np.random.randn(1,6)})
        print("Z3 = " + str(a))

How is this input image of size 64*64*3 is processed by 8 filter of each size 4*4*3?

stride = 1, padding = same and batch_size = 1.

What I have understood till now is each neuron in first conv layer will have 8 filters and each of them having size 4*4*3. Each neuron in first convolution layer will take portion of the input image which is same as filter size (which is here 4*4*3) and apply the convolution operation and produces eight 64*64 features mapping.

If my understanding is correct then:

1> Why we need striding operation since kernel size and portion input image proceed by each neuron is same, If we apply stride = 1(or 2) then boundary of portion of input image is cross which is something we don't need right ?

2> How do we know which portion of input image (same as kernel size) is mapped which neuron in first conv layer?

If not then:

3> How input image is passed on neurons in first convolution layer, Is is complete input image is passed on to each neuron (Like in fully connected ANN, where all the input features are mapped to each neuron in first hidden layer)?

Or portion of input image ? How do we know which portion of input image is mapped which neuron in first conv layer?

4> Number of kernel specified above example (W1= [4, 4, 3, 8]) is per neuron or total number of kernel in fist conv layer ?

5> how do we know how may neurons used by above example in first convolution layer.

6> Is there any relationship between number of neurons and number of kernel first conv layer.

Jay Jay · Accepted Answer · 2020-03-16T06:39:46

I found relevant answers to my questions and posting same here.

First of all concept of neuron is exist in conv layer as well but it's indirectly. Basically each neuron in conv layer deals with portion of input image which is same as the size of the kernel used in that conv layer.

Each neuron will focus on only particular portion of input image (Where in fully-connected ANN each neuron focus on whole image) and each neuron use n number of filters/kernels to get more insight of particular portion of image.

These n filters/kernels shared by all the neurons in given conv layer. Because of these weight(kernel/filter) sharing nature conv layer will have less number of parameter to learn. Where as in fully connected ANN network each neuron as it's own weight matrix and hence number of parameter to learn is more.

Now the number of neurons in given conv layer 'L' is depends on input_size (output of previous layer L-1), Kernel_size used in layer L , Padding used in layer L and Stride used in layer L.

Now let answer each of the questions specified above.

1> How do we know which portion of input image (same as kernel size) is mapped which neuron in first conv layer?

    From above code example for conv layer 1:
    Batch size = 1
    Input image size = 64*64*3  
    Kernel size = 4*4*3 ==> Taken from W1
    Number of kernel = 8 ==> Taken from W1
    Padding = same
    stride = 1

    Stride = 1 means that you are sliding the kernel one pixel at a time. Let's consider x axis and number pixels 1, 2, 3 4 ... and 64. 

    The first neuron will see pixels 1 2,3 and 4, then the kernel is shifted by one pixel and the next neuron will see pixels 2 3, 4 and 5 and last neuron will see pixels 61, 62, 63 and 64 This happens if you use valid padding. 

    In case of same padding, first neuron will see pixels 0, 1, 2, and 3, the second neuron will see pixels 1, 2, 3 and 4, the last neuron will see pixels 62,63, 64 and (one zero padded). 

    In case the same padding case, you end up with the output of the same size as the image (64 x 64 x 8). In the case of valid padding, the output is (61 x 61 x 8).

    Where 8 in output represent the number of filters.

2> How input image is passed on neurons in first convolution layer, Is is complete input image is passed on to each neuron (Like in fully connected ANN, where all the input features are mapped to each neuron in first hidden layer)?

Neurons looks for only portion of input image, Please refer the first question answer you will be able map between input image and neuron.

3> Number of kernel specified above example (W1= [4, 4, 3, 8]) is per neuron or total number of kernel in fist conv layer ?

It's total number of kernels for that layer and all the neuron i that layer will share same kernel for learning different portion of input image. Hence in convnet number of parameter to be learn is less compare to fully-connected ANN.

4> How do we know how may neurons used by above example in first convolution layer ?

It depends on input_size (output of previous layer L-1), Kernel_size used in layer L , Padding used in layer L and Stride used in layer L. Please refer first question answer above for more clarification.

5> Is there any relationship between number of neurons and number of kernel first conv layer

There is no relationship with respect numbers, But each neuron uses n number of filters/kernel (these kernel are shared among all the neurons in particular layer)to learn more about particular portion of input image.

Below sample code will help us clarify the internal implementation of convolution operation.

def conv_forward(A_prev, W, b, hparameters):
    """
    Implements the forward propagation for a convolution function

    Arguments:
    A_prev -- output activations of the previous layer, numpy array of shape (m, n_H_prev, n_W_prev, n_C_prev)
    W -- Weights, numpy array of shape (f, f, n_C_prev, n_C)
    b -- Biases, numpy array of shape (1, 1, 1, n_C)
    hparameters -- python dictionary containing "stride" and "pad"

    Returns:
    Z -- conv output, numpy array of shape (m, n_H, n_W, n_C)
    cache -- cache of values needed for the conv_backward() function
    """

    # Retrieve dimensions from A_prev's shape (≈1 line)  
    (m, n_H_prev, n_W_prev, n_C_prev) = A_prev.shape

    # Retrieve dimensions from W's shape (≈1 line)
    (f, f, n_C_prev, n_C) = W.shape

    # Retrieve information from "hparameters" (≈2 lines)
    stride = hparameters['stride']
    pad = hparameters['pad']

    # Compute the dimensions of the CONV output volume using the formula given above. Hint: use int() to floor. (≈2 lines)
    n_H = int(np.floor((n_H_prev-f+2*pad)/stride)) + 1
    n_W = int(np.floor((n_W_prev-f+2*pad)/stride)) + 1

    # Initialize the output volume Z with zeros. (≈1 line)
    Z = np.zeros((m,n_H,n_W,n_C))

    # Create A_prev_pad by padding A_prev
    A_prev_pad = zero_pad(A_prev,pad)

    for i in range(m):                               # loop over the batch of training examples
        a_prev_pad = A_prev_pad[i]                               # Select ith training example's padded activation
        for h in range(n_H):                           # loop over vertical axis of the output volume
            for w in range(n_W):                       # loop over horizontal axis of the output volume
                for c in range(n_C):                   # loop over channels (= #filters) of the output volume

                    # Find the corners of the current "slice" (≈4 lines)
                    vert_start = h*stride
                    vert_end = vert_start+f
                    horiz_start = w*stride
                    horiz_end = horiz_start+f

                    # Use the corners to define the (3D) slice of a_prev_pad (See Hint above the cell). (≈1 line)
                    a_slice_prev = a_prev_pad[vert_start:vert_end,horiz_start:horiz_end,:]

                    # Convolve the (3D) slice with the correct filter W and bias b, to get back one output neuron. (≈1 line)
                    Z[i, h, w, c] = conv_single_step(a_slice_prev,W[:,:,:,c],b[:,:,:,c])                                      

    return Z



A_prev = np.random.randn(1,64,64,3)
W = np.random.randn(4,4,3,8)
#Don't worry about bias , tensorflow will take care of this.
b = np.random.randn(1,1,1,8)
hparameters = {"pad" : 1,
               "stride": 1}

Z = conv_forward(A_prev, W, b, hparameters)

How to map input image with neurons in first conv layer in CNN?

1 Answers