2
votes

I have a simple cost function, which I want to optimize using scipy.optimize.minimize function.

opt_solution  = scipy.optimize.minimize(costFunction, theta, args = (training_data,), method = 'L-BFGS-B', jac = True, options = {'maxiter': 100)

where costFunction is the function to be optimized, theta are the parameters to be optimized. Inside costFunction, I printed the value of cost function. But the parameter maxiter seems to have no effect whether I increase value from 10 to 100000. The time it is taking is same. Also, I was expecting the printed value of cost function should be equal to the values of maxiter. So I am feeling maxiter has no effect. What might be the problem ? Cost function is

def costFunction(self, theta, input):

    """ Extract weights and biases from 'theta' input """

    W1 = theta[self.limit0 : self.limit1].reshape(self.hidden_size, self.visible_size)
    W2 = theta[self.limit1 : self.limit2].reshape(self.visible_size, self.hidden_size)
    b1 = theta[self.limit2 : self.limit3].reshape(self.hidden_size, 1)
    b2 = theta[self.limit3 : self.limit4].reshape(self.visible_size, 1)

    """ Compute output layers by performing a feedforward pass
        Computation is done for all the training inputs simultaneously """

    hidden_layer = self.sigmoid(numpy.dot(W1, input) + b1)
    output_layer = self.sigmoid(numpy.dot(W2, hidden_layer) + b2)

    """ Compute intermediate difference values using Backpropagation algorithm """

    diff = output_layer - input
    sum_of_squares_error = 0.5 * numpy.sum(numpy.multiply(diff, diff)) / input.shape[1]
    weight_decay         = 0.5 * self.lamda * (numpy.sum(numpy.multiply(W1, W1)) + numpy.sum(numpy.multiply(W2, W2)))
    cost                 = sum_of_squares_error + weight_decay 

    """ Compute the gradient values by averaging partial derivatives
        Partial derivatives are averaged over all training examples """

    W1_grad = numpy.dot(del_hid, numpy.transpose(input))
    W2_grad = numpy.dot(del_out, numpy.transpose(hidden_layer))
    b1_grad = numpy.sum(del_hid, axis = 1)
    b2_grad = numpy.sum(del_out, axis = 1)

    W1_grad = W1_grad / input.shape[1] + self.lamda * W1
    W2_grad = W2_grad / input.shape[1] + self.lamda * W2
    b1_grad = b1_grad / input.shape[1]
    b2_grad = b2_grad / input.shape[1]

    """ Transform numpy matrices into arrays """

    W1_grad = numpy.array(W1_grad)
    W2_grad = numpy.array(W2_grad)
    b1_grad = numpy.array(b1_grad)
    b2_grad = numpy.array(b2_grad)

    """ Unroll the gradient values and return as 'theta' gradient """

    theta_grad = numpy.concatenate((W1_grad.flatten(), W2_grad.flatten(),
                                    b1_grad.flatten(), b2_grad.flatten()))
    # Update counter value
    self.counter += 1                                
    print "Index ", self.counter, "cost ", cost
    return [cost, theta_grad]
1
What's your cost function?Ken Wei
cost function is simple mean square error like (x-x')^2.Shyamkkhadka

1 Answers

1
votes

maxiter gives the maximum number of iterations that scipy will try before giving up on improving the solution. But it may very well be satisfied with a solution and stop earlier.

If you look at the docs for minimize when using the 'l-bfgs-b' method, notice there are three parameters you can pass as options (factr, ftol and gtol) that can also cause the iteration to stop.

In simple cases like yours, especially if your cost function also provides the gradient (as indicated by jac=True in your call), convergence typically happens in the first few iterations, hence way before the maxiter limit is reached.