当前位置：网站首页>Nonlinear optimization: steepest descent method, Newton method, Gauss Newton method, Levenberg Marquardt method

Nonlinear optimization: steepest descent method, Newton method, Gauss Newton method, Levenberg Marquardt method

2022-07-02 10:19:00 【Jason. Li_ 0012】

Nonlinear optimization

For a minimum nonlinear multiplication problem ：
$\min_xF(x)=\frac{1}{2}\begin{Vmatrix}f(x)\end{Vmatrix}^2_2$
Find the function in the formula $f (x)$ Of L2 The minimum value of half of the sum of squares of norms ,L2 Norm refers to the square operation after taking the sum of squares of each item .

Solve the minimum objective function , Into the independent variable when its derivative is zero $x$ Value ：
$\frac{dF}{dx}=0$
It is generally used iteration To solve , Start with an initial value , Update variables iteratively , Make the value of the objective function drop ：
$\begin{aligned} 1.& Given the initial value x_0\\ 2.& For the first k Sub iteration , Look for increments \Delta x_k bring \begin{Vmatrix}f(x_k+\Delta x_k)\end{Vmatrix}^2_2 To reach a minimum \\ 3.& if \Delta x_k Small enough , If the condition is met, stop \\ 4.& conversely ,x_{k+1}=x_k+\Delta x_k, Continue iteration for the next round \\ \end{aligned}$
Solve increment $\Delta x_k$ The way , There are usually the following ： The steepest descent 、 Newton method 、 gaussian - Newton's method and lewenberg - Marquardt method et al .

The steepest descent

Consider the $k$ Sub iteration , Make the objective function First order Taylor expansion ：
$F(x_k+\Delta x_k)\approx F(x_k)+J(x_k)^T\Delta x_k$
matrix $J(x_k)$ For the function $F (x)$ About $x$ First derivative of , be called Jacobian matrix （Jacobian Matrix）.

At this time, you can select the increment as follows ：
$\Delta x^*=arg\min\Bigl(F(x_k)+J(x_k)^T\Delta x_k\Bigr)$
Right side to $\Delta x_k $ Derivation and zero ：
$\Delta x^*=-J(x_k)$
Usually , A step size needs to be further defined $\lambda$ , So as to advance along the direction gradient , Make the objective function descend under the first-order linear approximation . The steepest descent method is too greedy , As a result, its descent route is easy to form a sawtooth route （ Go to the edge of the grid ）, Thus increasing the number of iterations .

Newton method

Consider the $k$ Sub iteration , Make the objective function Second order Taylor expansion ：
$F(x_k+\Delta x_k)\approx F(x_k)+J(x_k)^T\Delta x_k+\frac{1}{2}\Delta x_k^TH(x_k)\Delta x_k$
matrix $H(x_k)$ For the function $F (x)$ About $x$ The second derivative of , be called Hazen matrix （Hessian Matrix）, At this time, the increment is as follows ：
$\Delta x^*=arg\min\Bigl( F(x_k)+J(x_k)^T\Delta x_k+\frac{1}{2}\Delta x_k^TH(x_k)\Delta x_k\Bigr)$
Again , The right derivative takes zero ：
$H(x_k)\Delta x_k=-J(x_k)$
Solve the above linear equation , You can get the incremental expression . Newton's method requires a lot of computational effort to solve $H$ , This problem should be avoided .

The quasi Newton method is usually used to solve the least square problem .

gaussian - Newton method

For an objective function $F (x)$ ：
$\min_xF(x)=\frac{1}{2}\begin{Vmatrix}f(x)\end{Vmatrix}^2_2$
Gauss Newton method passes $f (x)$ Do Taylor expansion , Instead of $F (x)$ Target expansion , To improve efficiency .
$f(x_k+\Delta x_k)\approx f(x_k)+J(x_k)^T\Delta x_k$
here , Solve the increment so that $\begin{Vmatrix}f(x+\Delta x_k)\end{Vmatrix}^2$ Minimum ：
$\Delta x^* = arg\min_{x}\frac{1}{2}\begin{Vmatrix}f(x_k)+J(x_k)^T\Delta x_k\end{Vmatrix}^2$
Simplify the expression ：
$\begin{aligned} \frac{1}{2}\begin{Vmatrix}f(x_k)+J(x_k)^T\Delta x_k\end{Vmatrix}^2&=\frac{1}{2}\Bigl(f(x_k)+J(x_k)^T\Delta x_k\Bigr)^T\Bigl(f(x_k)+J(x_k)^T\Delta x_k\Bigr)\\ &=\frac{1}{2}\Bigl(\begin{Vmatrix}f(x_k)\end{Vmatrix}^2_2+2f(x_k)J(x)^T\Delta x_k+\Delta x_k^TJ(x_k)J(x_k)^T\Delta x_k\Bigr) \end{aligned}$
Solve to minimize $x$ That's right $\Delta x$ The derivative is zero ：
$\begin{aligned} J(x_k)f(x_k)+J(x_k)J^T(x_k)\Delta x_k =0\\ J(x_k)J^T(x_k)\Delta x_k=-J(x_k)f(x_k) \end{aligned}$
remember $H(x_k)=J(x_k)J^T(x_k)$ , $g(x_k)=-J(x_k)f(x_k)$ , So as to get information about $\Delta x_k$ A linear system of equations ：
$H(x_k)\Delta x_k=g(x_k)$
Call it The incremental equation , or Gauss Newton equation （Gauss-Newton Equation）, Normal equation （Normal Equation）. gaussian - Newton's method uses $JJ^T$ Second order Hessian matrix in approximate Newton method $H$ , So as to avoid a lot of calculations , The algorithm flow is as follows ：
$\begin{aligned} 1.& Given the initial value x_0\\ 2.& Right. k Sub iteration , Find the Jacobian matrix J(x_k) And error f(x_k)\\ 3.& Solving the incremental equation ：H\Delta x_k=g\\ 4.& if \Delta x_k Small enough , If the condition is met, stop \\ 5.& conversely ,x_{k+1}=x_k+\Delta x_k, Continue iteration for the next round \\ \end{aligned}$

For the solution of incremental equation , The matrix should be satisfied $H$ reversible , but $JJ^T$ Is a positive semidefinite matrix , Singular matrices or ill conditioned situations may occur , This leads to the non convergence of the algorithm . Levenberg - Marquardt method corrects the above problem to a certain extent .

Levenberg - The Marquardt method

Levenberg - The convergence speed of Marquardt method is slower than Gauss - Newton method , But it is more robust , Also known as damped newton method （Damped Newton Method）

gaussian - Newton's second-order Taylor expansion to approximate linearization , Its effect only has a good approximate effect near the expansion point . So deal with $\Delta x_k$ Add an interval range , be called Confidence interval （Trust Region）. It is considered that it is approximately valid within the confidence interval and invalid outside the approximate interval .

The difference between the approximate model and the actual function is used to determine the scope of the confidence interval ：
$\rho=\frac{f(x_k+\Delta x_k)-f(x_k)}{J(x_k)^T\Delta x}$
indicators $\rho$ Used to describe the good or bad degree of approximation , among , The denominator is the value decreased by the approximate model , The numerator is the decreasing value of the actual function .

$\rho$ near 1, The approximate effect is good , The approximate range should be expanded ; $\rho$ smaller , The approximation effect is poor , The approximate range should be reduced . thus , Build Levenberg - Marquardt algorithm model ：
$\begin{aligned} 1.& Given the initial value x_0 And the initial optimization radius \mu\\ 2.& Right. k Sub iteration , The trust region is added on the basis of Gauss Newton method ：\\ &\qquad\qquad\min_{\Delta x_k}\frac{1}{2}\begin{Vmatrix}f(x_k)+J(x_k)^T\Delta x_k\end{Vmatrix}^2,\quad s.t.\quad \begin{Vmatrix}D\Delta x_k\end{Vmatrix}^2\leq\mu, \\ &\qquad\qquad among ,\mu Is the radius of the confidence interval ,D Is the coefficient matrix \\ 3.& Calculate the index ：\rho=\frac{f(x_k+\Delta x_k)-f(x_k)}{J(x_k)^T\Delta x}\\ 4.& if \rho>0.75, Is set \mu=2\mu\\ 5.& if \rho<0.25, Is set \mu=0.5\mu\\ 6.& if \rho In a certain threshold range , It is considered to be approximately feasible ,x_{k+1}=x_k+\Delta x_k\\ 7.& Judge whether convergence or not , If it converges, it ends , Otherwise, return to the second iteration \\ \end{aligned}$
here , Limit the increment value to the radius $\mu$ In the ball （ $\begin{Vmatrix}\Delta x_k\end{Vmatrix}^2\leq\mu$ ）, After adding the coefficient matrix , It can be regarded as an ellipsoid （ $\begin{Vmatrix}D\Delta x_k\end{Vmatrix}^2\leq\mu$ ）.

Levenberg takes $D = I$ , That is, incremental constraint in the ball . And Marquardt will $D$ Take it as a non negative diagonal matrix （ In practice, we usually take $J^TJ$ Square root of diagonal element ）, Thus, the constraint range on the dimension with small gradient is larger .

Construct Lagrange equation , Put the constraint into the objective function ：
$L(\Delta x_k, \lambda)=\frac{1}{2}\begin{Vmatrix}f(x_k)+J(x_k)^T\Delta x_k\end{Vmatrix}^2+\frac{\lambda}{2}\Bigl(\begin{Vmatrix}D\Delta x_k\end{Vmatrix}^2-\mu\Bigr)$
call $\lambda$ Is the Lagrange multiplier , Ask about $\Delta x_k$ The derivative of is zero ：
$(H+\lambda D^TD)\Delta x_k =g$
When parameters $\lambda$ More hours , $H$ Occupy a dominant position , The quadratic approximation model is better in the range , Levenberg - Marquardt method is close to Gauss Newton method ; When parameters $\lambda$ large , $\lambda D^TD$ Occupy a dominant position , The quadratic approximation model is poor in the range , Levenberg - The Marquardt method is close to the steepest descent method . Levenberg - Marquardt method to a certain extent , The coefficient matrix of linear equations can be avoided to be nonsingular 、 Sick questions , Provide more stability 、 More accurate increments $\Delta x_k$