I'm using TensorFlow for a multi-target regression problem. Specifically, in a convolutional network with pixel-wise labeling with the input being an image and the label being a "heat-map" where each pixel has a float value. More specifically, the ground truth labeling for each pixel is lower bounded by zero, and, while technically having no upper bound, usually gets no larger than 1e-2.
Without batch normalization, the network is able to give a reasonable heat-map prediction. With batch normalization, the network takes much long to get to reasonable loss value, and the best it does is making every pixel the average value. This is using the tf.contrib.layers conv2d and batch_norm methods, with the batch_norm being passed to the conv2d's normalization_fn (or not in the case of no batch normalization). I had briefly tried batch normalization on another (single value) regression network, and had trouble then as well (though, I hadn't tested that as extensively). Is there a problem using batch normalization on regression problems in general? Is there a common solution?
If not, what could be some causes batch normalization failing on such an application? I've attempted a variety of initializations, learning rates, etc. I would expect the final layer (which of course does not use batch normalization) could use weights to scale the output of the penultimate layer to the appropriate regression values. Failing that, I removed batch norm from that layer, but with no improvement. I've attempted a small classification problem using batch normalization and saw no problem there, so it seems reasonable that it could be due somehow to the nature of the regression problem, but I don't know how that could cause such a drastic difference. Is batch normalization known to have trouble on regression problems?