Supervised Learning 101
This is a pretty deep question, so I'm going to review the basics first to make sure we understand each other. In its simplest form, supervised learning, and classification in particular, attempts to learn a function f
such that y=f(x)
, from a set of observations {(x_i,y_i)}
. The following problems arise in practice:
- You know nothing about
f
. It could be a polynomial, exponential, or some exotic highly non-linear thing that doesn't even have a proper name in math.
- The dataset you're using to learn is just a limited, and potentially noisy, subset of the true data distribution you're trying to learn.
Because of this, any solution you find will have to be approximate. The type of architecture you will use will determine a family of function h_w(x)
, and each value of w
will represent one function in this family. Note that because there is usually an infinite number of possible w
, the family of functions h_w(x)
are often infinitely large.
The goal of learning will then be to determine which w
is most appropriate. This is where gradient descent intervenes: it is just an optimisation tool that helps you pick reasonably good w
, and thus select a particular model h(x)
.
The problem is, the actual f
function you are trying to approximate may not be part of the family h_w
you decided to pick, and so you are .
Answering the actual questions
Now that the basics are covered, let's answer your questions:
Putting a non-linear activation function like sigmoid at the output of a single layer model ANN will not help it learn a non-linear function. Indeed a single layer ANN is equivalent to linear regression, and adding the sigmoid transforms it into Logistic Regression. Why doesn't it work? Let me try an intuitive explanation: the sigmoid at the output of the single layer is there to squash it to [0,1]
, so that it can be interpreted as a class membership probability. In short, the sigmoid acts a differentiable approximation to a hard step function. Our learning procedure relies on this smoothness (a well-behaved gradient is available everywhere), and using a step function would break eg. gradient descent. This doesn't change the fact that the decision boundary of the model is linear, because the final class decision is taken from the value of sum(w_i*x_i)
. This is probably not really convincing, so let's illustrate instead using the Tensorflow Playground. Note that the learning rule does not matter here, because the family of function you're optimising over consist only of linear functions on their input, so you will never learn a non-linear one!
If you drop the sigmoid activation, you're left with a simple linear regression. You don't even project your result back to [0,1]
, so the output will not be simple to interpret as class probability, but the final result will be the same. See the Playground for a visual proof.
What is needed then?
To learn a non-linearly separable problem, you have several solutions:
Preprocess the input x into x', so that taking x' as an input makes the problem linearly separable. This is only possible if you know the shape that the decision boundary should take, so generally only applicable to very simple problems. In the playground problem, since we're working with a circle, we can add the squares of x1
and x2
to the input. Although our model is linear in its input, an appropriate non-linear transformation of the input has been carefully selected, so we get an excellent fit.
We could try to automatically learn the right representation of the data, by adding one or more hidden layers, which will work to extract a good non-linear transformation. It can be proven that using a single hidden layer is enough to approximate anything as long as make the number of hidden neurons high enough. For our example, we get a good fit using only a few hidden neurons with ReLU activations. Intuitively, the more neurons you add, the more "flexible" the decision boundary can become. People in deep learning have been adding depth rather than width because it can be shown that making the network deeper makes it require less neurons overall, even though it makes training more complex.