8
votes

I understand the role of the bias node in neural nets, and why it is important for shifting the activation function in small networks. My question is this: is the bias still important in very large networks (more specifically, a convolutional neural network for image recognition using the ReLu activation function, 3 convolutional layers, 2 hidden layers, and over 100,000 connections), or does its affect get lost by the sheer number of activations occurring?

The reason I ask is because in the past I have built networks in which I have forgotten to implement a bias node, however upon adding one have seen a negligible difference in performance. Could this have been down to chance, in that the specifit data-set did not require a bias? Do I need to initialise the bias with a larger value in large networks? Any other advice would be much appreciated.

2

2 Answers

8
votes

The bias node/term is there only to ensure the predicted output will be unbiased. If your input has a dynamic (range) that goes from -1 to +1 and your output is simply a translation of the input by +3, a neural net with a bias term will simply have the bias neuron with a non-zero weight while the others will be zero. If you do not have a bias neuron in that situation, all the activation functions and weigh will be optimized so as to mimic at best a simple addition, using sigmoids/tangents and multiplication.

If both your inputs and outputs have the same range, say from -1 to +1, then the bias term will probably not be useful.

You could have a look at the weigh of the bias node in the experiment you mention. Either it is very low, and it probably means the inputs and outputs are centered already. Or it is significant, and I would bet that the variance of the other weighs is reduced, leading to a more stable (and less prone to overfitting) neural net.

3
votes

Bias is equivalent to adding a constant like 1 to the input of every layer. Then the weight to that constant is equivalent to your bias. It's really simple to add.

Theoretically it isn't necessary since the network can "learn" to create it's own bias node on every layer. One of the neurons can set it's weight very high so it's always 1, or at 0 so it always outputs a constant 0.5 (for sigmoid units.) This requires at least 2 layers though.