当前位置：网站首页>Introduction to gradient descent method - black horse programmer machine learning handout

Introduction to gradient descent method - black horse programmer machine learning handout

2022-06-24 04:36:00 【Dark horse programmer official】

Learning goals

Know the principle of full gradient descent algorithm
Know the principle of random gradient descent algorithm
Know the principle of random average gradient descent algorithm
Know the principle of small batch gradient descent algorithm

In the previous section, we introduced the most basic implementation process of gradient descent method , Common gradient descent algorithms are ：

Full gradient descent algorithm (Full gradient descent）,
Stochastic gradient descent algorithm （Stochastic gradient descent）,
Small batch gradient descent algorithm （Mini-batch gradient descent）,
Random average gradient descent algorithm （Stochastic average gradient descent）

They are all to adjust the weight vector correctly , By calculating a gradient for each weight , To update the weights , Minimize the objective function as much as possible . The difference is that the samples are used in different ways .

1 Full gradient descent algorithm （FG）

Calculate all sample errors of the training set , Sum them and then take the average value as the objective function .

The weight vector moves in the opposite direction of its gradient , So as to reduce the current objective function by the most .

Because when you perform each update , We need to calculate all gradients over the entire data set , So the batch gradient descent method will be very slow , meanwhile , The batch gradient descent method cannot process data sets that exceed the memory capacity limit .

The batch gradient descent method also can not update the model online , That is, in the process of operation , No new samples can be added .

It is to calculate the loss function with respect to the parameters on the whole training data set θ Gradient of ：

2 Stochastic gradient descent algorithm （SG）

because FG All sample errors need to be calculated for each iteration to update the weight , In practical problems, there are often hundreds of millions of training samples , Therefore, the efficiency is low , And it is easy to fall into local optimal solution , Therefore, a random gradient descent algorithm is proposed .

The objective function of each round of calculation is no longer the error of all samples , It's just a single sample error , namely Calculate the gradient of the objective function of one sample at a time to update the weight , Take another sample and repeat the process , Until the loss function value stops falling or the loss function value is less than a tolerable threshold .

This process is simple , Efficient , Generally, the convergence of update iteration to local optimal solution can be avoided . Its iterative form is

among ,x(i) Represents the eigenvalue of a training sample ,y(i) Represents the tag value of a training sample

But because of ,SG Use only one sample iteration at a time , In case of noise, it is easy to fall into local optimal solution .

3 Small batch gradient descent algorithm （mini-batch）

The small batch gradient descent algorithm is FG and SG A compromise of , To some extent, it takes into account the advantages of the above two methods .

A small sample set is randomly selected from the training sample set every time , On the extracted small sample set FG Iteratively update weights .

The number of sample points contained in the extracted small sample set is called batch_size, Usually set to 2 Power square , Better for GPU Accelerated processing .

Special , if batch_size=1, It becomes SG; if batch_size=n, It becomes FG. Its iterative form is

4 Random average gradient descent algorithm （SAG）

stay SG In the method , Although it avoids the problem of high computational cost , But for big data training ,SG The effect is often unsatisfactory , Because each round of gradient update is completely independent of the data and gradient of the previous round .

The random average gradient algorithm overcomes this problem , Maintain an old gradient in memory for each sample , Randomly select the second i Samples to update the gradient of this sample , The gradient of other samples remains unchanged , Then we get the average of all the gradients , Then the parameters are updated .

such , Each round of updating only needs to calculate the gradient of one sample , Calculating the cost is equivalent to SG, But it converges much faster .

5 Summary

Full gradient descent algorithm （FG）【 know 】
- When doing the calculation , Calculate the average error of all samples , As my objective function
Stochastic gradient descent algorithm （SG）【 know 】
- Select only one sample at a time for assessment
Small batch gradient descent algorithm （mini-batch）【 know 】
- Select a part of samples for assessment
Random average gradient descent algorithm （SAG）【 know 】
- Will maintain an average value for each sample , In the later calculation , Refer to this average

原网站

版权声明
本文为[Dark horse programmer official]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/175/202206232359536397.html