I posted this question yesterday asking if my Neural Network (that I'm training via backpropagation using stochastic gradient descent) was getting stuck in a local minima. The following papers talk about the problem of the local minima in an XOR neural network. The first one says that there isn't a problem of a local minima, whereas as the next paper (written a year later) says that there is a problem of a local minima in a 2-3-1 XOR neural network (as an aside, I'm using a 3-3-1 i.e., bias on the input and hidden layers). Both of these are abstracts (I don't have access to the full paper so I'm unable to read it):
- XOR has no local minima: A case study in neural network error surface analysis. by Hamey LG. Department of Computing, Macquarie University, Sydney, Australia
- A local minimum for the 2-3-1 XOR network. by Sprinkhuizen-Kuyper IG, Boers EW.
There is also another paper [PDF] that says there isn't a local minima for the simplest XOR network, but it doesn't seem to be talking about a 2-3-1 network.
Now onto my actual question: I couldn't find anything that discussed the choice of the activation function, initial weights and what impact this has on whether the neural network will get stuck in a local minima. The reason I'm asking this question is that in my code I have tried using the standard sigmoid activation function and the hyperbolic tangent activation function. I noticed that in the former, I get stuck only around 20% of the time whereas in the latter I tend to get stuck far more often. I'm also randomizing my weights whenever I first initialize the network and so I'm wondering if a certain set of random weights is more disposed to making my neural network get "stuck".
As far as the activation function is concerned, since the error is eventually related to the output produced by the activation function, I'm thinking that there is an effect (i.e., the error surface changes). However, this is simply based on intuition and I'd prefer a concrete answer (for both points: initial weights and choice of the activation function).