Fully Convolutional Network, Training Error

Question

I apologize that I'm not good at English.

I'm trying to build my own Fully Convolutional Network using TensorFlow. But I have difficulties on training this model with my own image data, whereas the MNIST data worked properly.

Here is my FCN model code: (Not using pre-trained or pre-bulit model)

import tensorflow as tf
import numpy as np

Loading MNIST Data

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

images_flatten = tf.placeholder(tf.float32, shape=[None, 784])

images = tf.reshape(images_flatten, [-1,28,28,1]) # CNN deals with 3 dimensions
labels = tf.placeholder(tf.float32, shape=[None, 10])
keep_prob = tf.placeholder(tf.float32) # Dropout Ratio

Convolutional Layers

# Conv. Layer #1
W1 = tf.Variable(tf.truncated_normal([3, 3, 1, 4], stddev = 0.1))
b1 = tf.Variable(tf.truncated_normal([4], stddev = 0.1))    
FMA = tf.nn.conv2d(images, W1, strides=[1,1,1,1], padding='SAME')
# FMA stands for Fused Multiply Add, which means convolution
RELU = tf.nn.relu(tf.add(FMA, b1))
POOL = tf.nn.max_pool(RELU, ksize=[1,2,2,1], strides=[1,2,2,1], padding='SAME')

# Conv. Layer #2
W2 = tf.Variable(tf.truncated_normal([3, 3, 4, 8], stddev = 0.1))
b2 = tf.Variable(tf.truncated_normal([8], stddev = 0.1))    
FMA = tf.nn.conv2d(POOL, W2, strides=[1,1,1,1], padding='SAME')
RELU = tf.nn.relu(tf.add(FMA, b2))
POOL = tf.nn.max_pool(RELU, ksize=[1,2,2,1], strides=[1,2,2,1], padding='SAME')

# Conv. Layer #3
W3 = tf.Variable(tf.truncated_normal([7, 7, 8, 16], stddev = 0.1))
b3 = tf.Variable(tf.truncated_normal([16], stddev = 0.1))   
FMA = tf.nn.conv2d(POOL, W3, strides=[1,1,1,1], padding='VALID')
RELU = tf.nn.relu(tf.add(FMA, b3))

# Dropout
Dropout = tf.nn.dropout(RELU, keep_prob)

# Conv. Layer #4
W4 = tf.Variable(tf.truncated_normal([1, 1, 16, 10], stddev = 0.1))
b4 = tf.Variable(tf.truncated_normal([10], stddev = 0.1))   
FMA = tf.nn.conv2d(Dropout, W4, strides=[1,1,1,1], padding='SAME')
LAST_RELU = tf.nn.relu(tf.add(FMA, b4))

Summary: [Conv-ReLU-Pool] - [Conv-ReLU-Pool] - [Conv-ReLU] - [Dropout] - [Conv-ReLU]

Define Loss, Accuracy

prediction = tf.squeeze(LAST_RELU) 
# Because FCN returns (1 x 1 x class_num) in training

loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(prediction, labels))
# First arg is 'logits=' and the other one is 'labels='

optimizer = tf.train.AdamOptimizer(0.001)    
train = optimizer.minimize(loss)

label_max = tf.argmax(labels, 1)
pred_max = tf.argmax(prediction, 1)
correct_pred = tf.equal(pred_max, label_max)
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

Training Model

sess = tf.Session()
sess.run(tf.global_variables_initializer())

for i in range(10000):
   image_batch, label_batch = mnist.train.next_batch(100)
   sess.run(train, feed_dict={images: image_batch, labels: label_batch, keep_prob: 0.8})
   if i % 10 == 0:
       tr = sess.run([loss, accuracy], feed_dict={images: image_batch, labels: label_batch, keep_prob: 1.0})
       print("Step %d, Loss %g, Accuracy %g" % (i, tr[0], tr[1]))

Loss: 0.784 (Approximately)

Accuracy: 94.8% (Approximately)

The problem is that, training this model with MNIST data worked very well, but with my own data, loss is always same(0.6319), and the output layer is always 0.

There is no difference with the code, excepting for the third convolutional layer's filter size. This filter size and input size which is compressed by previous pooling layers, must have same width & height. That's why the filter size in this layer is [7,7].

What is wrong with my model?..

The only different code between two cases (MNIST, my own data) is:

Placeholder

My own data has (128 x 64 x 1) and the label is 'eyes', 'not_eyes'

images = tf.placeholder(tf.float32, [None, 128, 64, 1])
labels = tf.placeholder(tf.int32, [None, 2])

3rd Convolutional Layer

W3 = tf.Variable(tf.truncated_normal([32, 16, 8, 16], stddev = 0.1))

Feeding (Batch)

image_data, label_data = input_data.get_batch(TRAINING_FILE, 10)

sess = tf.Session()
sess.run(tf.global_variables_initializer())
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)

for i in range(10000):
    image_batch, label_batch = sess.run([image_data, label_data])
    sess.run(train, feed_dict={images: image_batch, labels: label_batch, keep_prob: 0.8})
    if i % 10 == 0: ... # Validation part is almost same, too...

coord.request_stop()
coord.join(threads)

Here "input_data" is an another python file in the same directory, and "get_batch(TRAINING_FILE, 10)" is the function that returns batch data. The code is:

def get_input_queue(txtfile_name):
    images = []
    labels = [] 

    for line in open(txtfile_name, 'r'): # Here txt file has data's path, label, label number
        cols = re.split(',|\n', line)
        labels.append(int(cols[2]))
        images.append(tf.image.decode_jpeg(tf.read_file(cols[0]), channels = 1)) 

    input_queue = tf.train.slice_input_producer([images, labels], shuffle = True)
    return input_queue

def get_batch(txtfile_name, batch_size):
    input_queue = get_input_queue(txtfile_name)
    image = input_queue[0]
    label = input_queue[1]

    image = tf.reshape(image, [128, 64, 1])

    batch_image, batch_label = tf.train.batch([image, label], batch_size)
    batch_label_one_hot = tf.one_hot(tf.to_int64(batch_label), 2, on_value=1.0, off_value=0.0)
    return batch_image, batch_label_one_hot

It seems not to have any problem .... :( Please Help me..!!

Try scaling your inputs (-1 to 1) and set the stddev of your weights to a lower value like 0.01. The output always zero points to dead ReLUs. — vijay m

vijay m vijay m · Accepted Answer · 2017-07-01T18:04:20

Are your inputs scaled appropriately?. The jpegs are in [0-255] range and it needs to be scaled to [-1 - 1]. You can try:

 image = tf.reshape(image, [128, 64, 1])
 image = tf.scalar_mul((1.0/255), image)
 image = tf.subtract(image, 0.5)
 image = tf.multiply(image, 2.0)