2
votes

I have a pretty complicated model with many parameters that I need to solve. Even though the model is complicated, the functional form at each step is not irregular.

I'm seeing some strange behaviors with start values. If I start at standard, random values (all 0s), the solver converges with "Locally optimal solution found", 0 CG iterations, in 673s.

If I start at values that I know are close to the solutions, the solver converges with "Primal feasible solution estimate cannot be improved.", 493 CG iterations, in 1718s.

Note that in both cases, the final values are the same (or very similar).

2 questions:

  • What exactly is the number of conjugate gradient iterations, as in, when does the solver need to calculate the conjugate gradient? Here in 1 case I see 0 CG iteration, and in the other case 493 CG iterations. What does that imply? (Note that I do know what CG method is, just not sure why the huge difference here with 0 in one case.)
  • What are all the possible explanations that 'better' initial values can slow optimization convergence significantly?
3

3 Answers

5
votes

From your first question, we learn that you're employing a "smart" solver, i.e. that dynamically adjusts the algorithm to optimally converge. The conjugate gradient method is a good way for "long-range" finding of the optimum, but is slow to converge when you're close to a shallow optimum.

As with all "smart" code, there are situations where the heuristic fails, and you've encountered one. I assume that your optimum is rather shallow, so that the objective function (i.e. the actual criterion you're trying to optimize) varies little if your parameters change a bit. Now there's no way for the solver to know that the parameters are already very close to the optimum. For all it knows it could be very far from the solution in an area where the objective function is pretty flat. After some initial tests, thus defaults to the conjugate gradient method, which is a slow but safe way to approach the optimum. However, since after a lot of searching it doesn't actually get very far, it tells you that if you were lucky, you started close to the optimum, but if you were unlucky, your solution is far, far away from the optimal solution.

If you know that your initial guess will be pretty good, you may thus want to check whether your solver allows specifying which algorithms should/should not be used.

5
votes

Conjugate gradient is a gradient-type optimization algorithm (also called "steepest descent"), which can be prone to slow convergence in some cases. Even if you are close to the optimum.

On the WikiPedia, you can find figures to illustrate this behavior, I adapted one of them here:

optimization convergence problems

What you see are the iso-cost (or iso-objective) lines on the contour plot. Let's imagine we start from point 1, which is quite a good start value. The red lines show the path taken to reach the optimum. We see that it zig-zags towards the optimum, which takes a lot of function evaluations and therefore time.

If we compare that to the performance when we select point A as the starting value, we get faster convergence (or at least, that's what I expect in this case). Let's assume it takes just one iteration in that case.

Now take a look at point 5, it is clearly close to the optimum, but it takes a lot of iterations to get to the optimum. As you are approaching through a narrow valley, the algorithm will jump from one side to the other, making only very small progress towards the optimum on the way. When you approach from the wider side of the valley, you see that the gradient is directed more towards the optimum, which causes a faster convergence.

In your case, it might be the fact that your initial value is somewhat like point 5 above, while the generic starting value is comparable to point A. That's in the assumption that your starting value converges to the true value, which might not be the case. If your starting value is close but there is a peak between it and the global optimum, you will not converge to the right value as is the case in the following figure.

enter image description here

When knitro changes to either CG or one of their other algorithms, is something that should either be mentioned in the documentation or is only known by the developers of knitro.

-1
votes

Well, this is because the start values are just an approximation of the solution of the problem, since an iterative solution tries to get near by trying to converge into the final values, then, the more near to the answer start values are, the less iterations you need to converge. Also the convergence threshold should count.

This is the classic informed vs non informed problems, if you start with ad-hoc values, the result would be harder to find that if you start with good non ad-hoc values.