当前位置：网站首页>Li Hongyi machine learning team learning punch in activity day05 --- skills of network design

Li Hongyi machine learning team learning punch in activity day05 --- skills of network design

2022-07-27 05:27:00 【Charleslc's blog】

Write it at the front

Signed up for a team to study , This time I will learn the skills of network design , The corresponding teacher is Li Hongyi Deep learning The video P5-p9.

Reference video ：https://www.bilibili.com/video/av59538266
Reference notes ：https://github.com/datawhalechina/leeml-notes

Local minimum and saddle point

When the gradient drops , Optimization sometimes fails , That is, there is a point with a gradient of zero , But the point where the gradient is zero , Not only corresponding local minima( Local minimum ), It may also correspond to the saddle point (saddle point)

So how to judge saddle point still local minimal？

Use Taylor formula to judge , If it is Critical point( The first derivative is zero ), Then the second term is zero , Then we only need to judge the second-order differential .

If Hessian matrix H All eigenvalues of are greater than zero , So it's going to be Local minima; If the eigenvalues are all less than zero , So it's going to be Local maxima; If the characteristic value is greater than zero , And there are less than zero , So this point is Saddle point

If the point is stuck saddle point, Then go along the eigenvector with negative eigenvalue .
for example ： This matrix , An eigenvalue is calculated as -2, The eigenvector is [1;1], Just walk along the direction of the eigenvector to solve .

But this calculation is too much , It won't be used in practice .

batch( batch ) and momentum( momentum )

batch

We can use Batch To optimize , Samples can be divided , Perform gradient descent for each small sample .

Small Batch v.s. Large Batch

utilize GPU Parallel calculation of , It can be seen that once batch size by 100 and batch size by 1 The computing time is about the same

In general ,Smaller batch It takes more time .

Small batch It will have better operation effect .

summary

Moment

Moment Usually it will maintain the last gradient downward trend , That is, the last gradient decline trend will have an impact on this gradient decline .

Automatically adjust the learning rate

RMSProp

RMSProp Iterative way ：

Adam

Adam:RMSProp + Momentum

Suammary

Loss It also has an impact

Cross-entropy Than Mean Square Error More often used for classification .

Example ：

Batch standardization (Batch Normalization)

Feature Normalization

Feature Normalization The role of is Give Way $x_i$ The range setting of is the same

Considering Deep Learning

There are many layers in deep learning , The output from one layer is the input from the next , If it's on the upper layer input To deal with , However, in the next layer, the operation results in a large difference between the data , similarly , It should also be handled .

however , When change $z^i$ When , It will affect $\mu$ and $\sigma$ , $z^{(i+1)}$ The value of will also change , Every calculation changes , As a result, there are many intermediate results in the whole network , So we should consider batch processing , That is to say Batch Normalization

After obtaining , then $\gamma$ Multiply $\tilde{z}^i$ Plus $\beta$ . and $\beta$ and $\gamma$ It was trained alone , In order to prevent $\tilde{z}^i$ The average is 0, It will have a negative impact on Neural Networks .