Before pointing out a possible, simple implementation of a quasi-Newton optimization routine in CUDA, some words on how a quasi-Newton optimizer works.
Consider a function f of N real variables x and make a second order expansion around a certain point xi:
where A is the Hessian matrix.
To find a minimum starting from a point xi, Newton's method consists of forcing
which entails
and which, in turn, implies to know the inverse of the Hessian. Furthermore, to ensure the function decreases, the update direction
should be such that
which implies that
According to the above inequality, the Hessian matrix should be definite positive. Unfortunately, the Hessian matrix is not necessarily definite positive, especially far from a minimum of f, so using the inverse of the Hessian, besides being computationally burdened, can be also deleterious, pushing the procedure even farther from the minimum towards regions of increasing values of f.
Generally speaking, it is more convenient to use a quasi-Newton method, i.e., an approximation of the inverse of the Hessian, which keeps definite positive and updates iteration after iterations converging to the inverse of the Hessian itself.
A rough justification of a quasi-Newton method is the following. Consider
and
Subtracting the two equations, we have the update rule for the Newton procedure
The updating rule for the quasi-Newton procedure is the following
where Hi+1 is the mentioned matrix approximating the inverse of the Hessian and updating step after step.
There are several rules for updating Hi+1, and I'm not going into the details of this point. A very common one is provided by the Broyden-Fletcher-Goldfarb-Shanno, but in many cases the Polak-Ribiére scheme, is effective enough.
The CUDA implementation can follow the same steps of the classical Numerical Recipes approach, but taking into account that:
1) Vector and matrix operations can be effectively accomplished by CUDA Thrust or cuBLAS;
2) The control logic can be performed by the CPU;
3) Line minimization, involving roots bracketing and root findings, can be performed on the CPU, accelerating only the cost functional and gradient evaluations of the GPU.
By the above scheme, unknowns, gradients and Hessian can be kept on the device without any need to move them back and forth from host to device.
Please, note also that some approaches are available in the literature in which attempt to parallelize the line minimization are also proposed, see
Y. Fei, G. Rong, B. Wang, W. Wang, "Parallel L-BFGS-B algorithm on GPU", Computers & Graphics, vol. 40, 2014, pp. 1-9.
At this github page, a full CUDA implementation is available, generalizing the Numerical Recipes approach employing linmin
, mkbrak
and dbrent
to the GPU parallel case. That approach implements Polak-Ribiére's scheme, but can be easily generalized to other quasi-Newton optimization problems.