当前位置：网站首页>Li Hongyi machine learning (2021 Edition)_ P7-9: training skills

Li Hongyi machine learning (2021 Edition)_ P7-9: training skills

2022-07-27 01:13:00 【Although Beihai is on credit, Fuyao can take it】

Related information

Video link ：https://www.bilibili.com/video/BV1JA411c7VT

Original video link ：https://speech.ee.ntu.edu.tw/~hylee/ml/2021-spring.html

Leeml-notes Open source project ：https://github.com/datawhalechina/leeml-notes

1、optimize（ Adaptive learning rate ）

Insert picture description here
In training , A lot of times ,Loss No longer lower , But at this time , In fact, the gradient did not fall to a relatively low level .
One reason is that , The optimized parameters fluctuate at both ends of the gradient Valley , Stagnate ：

1.1、 The gradient change rate in different directions is different （ Convex function ）

1.1.1、 Applicable scenarios

Insert picture description here
The loss function is convex, The following figure w and b Express loss Two parameters of , It can be seen that ,b The direction gradient changes slowly ,w The direction gradient changes rapidly .

Use the same learning rate , It is difficult to get good results in two dimensions when optimizing , So different parameters need different learning rates .
Insert picture description here

1.1.2、 Adjust the proportion of learning rate

Original parameter iteration method ：
$\theta_{i}^{t+1}= \theta _{i}^{t}- {\eta}g_{i}^{t}$
Adopt the learning rate adjustment method ：
$\theta_{i}^{t+1}= \theta _{i}^{t}- \frac{\eta}{\sigma _{i}^{t}}g_{i}^{t}$
Introduce parameters $\sigma _{i}^{t}$ Adjust the learning rate , There are three calculation methods ：

1.1.3、Root Mean Square（ Root mean square ）

$\sigma ^ { t } _ { i } = \sqrt { \frac {1} { t +1 } \sum _ { i = 0 } ^ { t} ( g _ { i } ^{t}) ^ { 2 } }$
Insert picture description here
The root mean square algorithm takes into account all the values of the gradient calculated by the former , Root mean square . Make the calculation of learning rate in this step , Considering the gradient change of the former , Make this step gradient learn the previous change trend .
Adagrad The algorithm adopts this calculation method .
Insert picture description here
use RMS The optimization effect ：

Where the gradient changes greatly , The learning rate changes little , Gradually move ; Dimensions with small gradient changes , The learning rate changes rapidly , There will be a shock , Gradually approaching the best .

1.2、 The gradient in the same direction changes greatly

Insert picture description here

1.2.1、RMSProp Method

For some other loss, Nonconvex function type , The gradient changes little in a range , But in another range, the gradient changes greatly ; use RMS Method , The learning rate changes slowly , Adopt new methods , Improve the learning rate and change rate . among 0<𝛼<1
$\sigma_{i}^{t}= \sqrt{\alpha(\sigma _{i}^{t-1})^{2}+(1- \alpha)(g_{i}^{t})^{2}}$
Insert picture description here
RMSProp In the method , For learning rate , The recent gradient has greater influence , The gradient in the past has little effect .

1.2.2、Adam: RMSProp + Momentum

Insert picture description here

1.3、 Directly adjust the learning rate

In addition to adjusting the proportion of learning rate $\sigma_i^t$ , You can also directly modify the learning rate $\eta^t$ , For different times （ Training epoch）, Directly modify the learning rate （ Equivalent to macro-control ）
$\theta_{i}^{t+1}= \theta _{i}^{t}- \frac{\eta^t}{\sigma _{i}^{t}}g_{i}^{t}$

Learning Rate Decay（ Learning rate decline ）
With training , Closer to the destination , So gradually reduce the learning rate .
Warm Up（ Cosine annealing ）

Learning first increases and then decreases ;
In the first place , Calculation statistics $\sigma _{i}^{t}$ With large variance , That is, the statistics are unstable , At this time, the learning rate is relatively high , In the exploratory stage ; When after a while , Learning is relatively stable , The learning rate is lower .

1.4、 summary

Insert picture description here
$\theta_{i}^{t+1}= \theta _{i}^{t}- \frac{\eta ^{t}}{\sigma _{i}^{t}}m_{i}^{t}$
Adjust the learning rate in three ways （ Macroscopic ）：

adjustment $\sigma _{i}^{t}$ ： Adjust parameter dependencies , Consider the past learning rate , The vector $\eta$ Fine tuning ;
adjustment $\eta$ ： Directly adjust the learning rate $\eta^t$ , According to training experience , Direct macro-control of learning rate ;
adjustment $m_{i}^{t}$ ： Quote momentum $m_{i}^{t}$ , Consider the impact of past learning rates , Including direction and value , Help jump out local
minima.

2、 classification

2.1、 Definition

Classification is a method for input data , Output a type of result in a given type ;
Insert picture description here

2.2、 Network structure

Insert picture description here
For classified networks , For the same set of data , Perform multiple calculations , Get the same number of outputs as the number of categories , Compare the output results , Finally confirm the classification results .
Different from the regression algorithm , The classification algorithm needs to output results $y$ Conduct Softmax Handle , Get the activation value $y^\prime$ .
Insert picture description here
use Softmax Algorithm , Data can be mapped to 0-1.

2.3、 Loss function

The total loss function of the model is the average of each single data loss function , Loss function calculation includes mean square error and cross entropy function .
Insert picture description here

Mean square error Mean Square Error (MSE)

$\sum _{i}(\widehat{y}_{i}-y_{i}^{\prime})^{2}$
Cross entropy loss function Cross-entropy

$\sum _{i}\widehat{y}_{i}\ln y_{i}^{\prime}$
The above two loss functions ,Cross-entropy More suitable for classification algorithm . The following is an example , For a three classification algorithm ：

The mean square error and cross entropy are used to calculate the loss , The loss calculation diagram is as follows ：

The mean square error loss graph is too flat in many places , It's hard to train ;
The loss of cross entropy loss function fluctuates , Easy to train ;
Changing the loss function can change the difficulty of optimization ,（ Use Shenluo Tianzheng to flatten the rugged loss function ）.

3、 Batch normalization （Bath Normalization）

3.1、 Concept

A function has multiple input characteristics , And the distribution range of the input characteristic data is very different , It is recommended to scale their range , Make the range of different inputs the same . $y=b+w_{1}x_{1}+w_{2}x_{2}$
Insert picture description here

3.2、 reason

$x_1$ Yes y The impact of the change is relatively small , therefore $w_1$ The influence on the loss function is relatively small , $w_1$ There is a small differential for the loss function , therefore $w_1$ It is relatively smooth in the direction , Empathy $w_2$ The direction is steep .
Insert picture description here
For the case on the left , As mentioned above, there is no need to Adagrad It is difficult to deal with .

Different learning rates are required in both directions , The learning rate of the same group will not determine it . In the case on the right, it will be easier to update parameters .
The gradient descent on the left is not towards the lowest point , But along the normal direction of the contour tangent . But green can be towards the center （ The lowest point ） go , It is also more efficient to update parameters .

3.3、 Zoom method

Use the batch normalization method to scale , Zoom to the standard normal distribution .
Insert picture description here
Each column above is an example , There is a set of characteristics in it .
For each dimension i（ Green box ） Calculate the average , Remember to do $m_i$ ; Also calculate the standard deviation , Remember to do $\sigma _i$ .
Then use the r( features ) The... In the first example i（ data ） Inputs , Subtract the average $m_i$ , Then divide by the standard deviation $\sigma _i$ , The result is that all dimensions are 0, All variances are 1.（ Standard normal distribution ）

stay DL in , commonly BN It can be used before and after activating the function , When the activation function used is sigmoid when , Generally, it is carried out before the activation function BN.
Conduct BN After the operation , Because sharing $\mu$ and $\sigma$ , So all normalization parameters are interrelated .
Sometimes in order to adjust the average , Would be right BN The processed data is processed linearly ：
$\tilde{z}^{i}= \frac{z^{i}- \mu}{\sigma}$
$\widehat{z}^{i}= \gamma \cdot \tilde{z}^{i}+ \beta$
stay test in , Unable to gather enough bath When testing data , Some data values in the training network can be used .