当前位置：网站首页>Arrangement of statistical learning knowledge points gradient descent, least square method, Newton method

Arrangement of statistical learning knowledge points gradient descent, least square method, Newton method

2022-06-12 07:30:00 【Turned_ MZ】

Collation of statistical learning knowledge points —— gradient descent 、 Least square method 、 Newton method

gradient descent

gradient descent （gradient descent） It is widely used in machine learning , Whether in linear regression or Logistic In the regression , Its main purpose is to find the minimum value of the objective function through iteration , Or converge to the minimum .

Gradient descent is an intuitive explanation , It can be understood from the process of going down the mountain , In the process of going down the mountain , Choose the steepest position to go down the mountain each time , You can reach the lowest point as soon as possible （ Corresponding loss function , That is, minimizing the loss function ）

The mathematical formula ：

Θ1=Θ0+α▽J(Θ)→evaluatedatΘ0

The meaning of this formula is ：J It's about Θ A function of , Our current position is Θ0 spot , From this point to J The minimum value of , At the bottom of the mountain . First of all, let's determine the direction of our progress , That's the reverse of the gradient , And then take a long step , That is to say α, After this step , That's it Θ1 This point .

The algorithm process ：

Algorithm tuning ：

1. The choice of step size ： Step length too long , It can cause the iteration to be too fast , You may even miss the optimal solution . The step size is too small , Iteration speed is too slow , The optimal solution cannot be reached for a long time .

2. Parameter initial value selection ： It can be seen from the above figure , Gradient descent does not necessarily result in a global optimal solution , It may be a local optimal solution , So the choice of initial value will affect the final result , If the loss function is convex , Then it must be the optimal solution , Otherwise, it is necessary to run the algorithm with different initial values many times , Choose an algorithm that minimizes the loss function .

3. normalization ： Because the value range of different features of the image is different , It can lead to slow iterations , In order to reduce the influence of feature value , Feature data can be normalized , That is to say, it turns into expectation 1, The standard deviation is 1 The distribution of .

Gradient descent method family ：

Batch gradient descent method （BGD）: The specific method is to use all samples to update the parameters ;

Random gradient descent method （SGD）: Just take one sample and find the gradient ;

Small batch gradient descent method （MBGD）: yes BGD and SGD A compromise of , That is, randomly select some samples to update ;

Least square method

The formula is as follows ：

Observations are our multiple sets of samples , The theoretical value is our hypothetical fitting function , The objective function is also known as the loss function in machine learning .

The least square method is to find a set of bring ( The sum of squared residuals ) Minimum , namely , seek

The solution of least square method ：

Newton method

The basic idea of Newton's method is ： In the vicinity of the estimated value of the existing minimum point f（x） Second order Taylor expansion , And then find the next estimate of the minimum . set up xk Is the estimated value of the current minimum point , be ：

If you want to find the extreme value , Make , namely ：, And then get ：

The advantages of Newton's method ：

Newton method is an algorithm with quadratic convergence , For non quadratic functions , If the quadratic form of the function is strong , Or have entered the field of minimalism , Its convergence speed is very fast .

The shortcomings of Newton's method ：

There is no iteration step in Newton's method , Quadratic fixed length iteration , So for nonquadratic objective functions , Sometimes it makes the value of the function rise , This shows that Newton's method can not guarantee the stable decline of function value .

Gradient descent method 、 Least square method 、 The difference between Newton's methods

The same thing ： Are unconstrained optimization algorithms

Difference ：

The gradient descent method is relative to the least square method ： Gradient descent method needs to choose step size , And the least squares method doesn't need . The gradient descent method is an iterative solution , The least square method is to calculate the analytical solution . If the sample size is not large , And there is an analytic solution , The least square method has an advantage over the gradient descent method , It's very fast . But if the sample size is large , With the least square method, we need to find a super large inverse matrix , At this point, it is very difficult or slow to solve the analytical solution , The iterative gradient descent method has advantages .

Gradient descent method and Newton method / Compared with the quasi Newton method , Both are solved iteratively , But the gradient descent method is a gradient solution , And Newton's method / Quasi Newton method is to use the inverse matrix or pseudo inverse matrix of the second order Hessian matrix to solve . Relatively speaking , Using Newton's method / The quasi Newton method converges faster . But each iteration takes longer than the gradient descent method .

Reference article ：

Gradient descent method ：https://www.cnblogs.com/pinard/p/5970503.html

Notes on Newton method and quasi Newton method ：https://blog.csdn.net/itplus/article/details/21896453

How to understand the least square method ：https://blog.csdn.net/ccnt_2012/article/details/81127117

Least square method ：https://zhuanlan.zhihu.com/p/38128785

原网站

版权声明
本文为[Turned_ MZ]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/03/202203010556461086.html