当前位置:网站首页>Teacher lihongyi's notes on machine learning -4.2 batch normalization

Teacher lihongyi's notes on machine learning -4.2 batch normalization

2022-06-10 06:10:00 Ning Meng, Julie

notes : This article is my study of Teacher Li Hongyi 《 machine learning 》 Course 2021/2022 The notes ( Course website ), The pictures in the text are from the course PPT. Welcome to communicate and give more advice , thank you !

Lecture 4.2 training tip: Batch Normalization

This lesson introduces a Deep Neural Network Of training tip, In the training CNN When it comes to .

So let's go back , Encounter the situation shown on the left of the following figure , Two parameters loss The curves are very different , A steep , A gentle . It is not easy to set up such training learning rate, It's hard to get to loss The lowest point . The previous lesson introduced adaptive learning rate Solutions for , According to different parameters gradient Different , Adjust accordingly learning rate.

 Insert picture description here

This problem can also be changed landscape of error surface To solve . So let's analyze this , Why did this happen error surface, So you know how to adjust .

As shown in the figure above , w 1 w_1 w1 and w 2 w_2 w2 Of loss The reason why the curves differ so much , Because its input x 1 x_1 x1 and x 2 x_2 x2 The magnitude of values varies greatly . If w 1 , w 2 w_1, w_2 w1,w2 Make the same change △ w \triangle w w, x 1 x_1 x1 The value of is small , cause y y y Small change in , And then L L L The impact is also small , therefore w 1 w_1 w1 The direction of the gradient Just small . and x 2 x_2 x2 Value ratio of x 1 x_1 x1 Two orders of magnitude larger , cause y y y Great changes in . Accordingly , w 2 w_2 w2 The direction of the gradient Big .

Through this analysis , We found that , Must let w 1 w_1 w1 and w 2 w_2 w2 The direction of the gradient Close ( As shown on the right of the figure above ), You should make the corresponding input x 1 x_1 x1 and x 2 x_2 x2 Is in the same range . The corresponding method is feature normalization.

In this paper, feature normalization One way . The specific operation is : hypothesis x 1 , x 2 , . . . , x R x^1, x^2, ...,x^R x1,x2,...,xR Represents a set of inputs samples, Put them in each dimension (feature) Subtract the corresponding mean from (mean), Divided by the corresponding variance (variance), As shown in the formula in the following figure . finish normalization after , Input samples The mean in each dimension is 0, The variance is 1.

 Insert picture description here

My thinking : Traditional machine learning methods , For example, similar operations will be performed on data in linear regression . However, in traditional machine learning, features are selected manually first , And then we do on the eigenvector Normalization Or normalization of maximum and minimum values . Deep learning is a feature of automatic machine learning , I don't know how the model will select or combine each dimension vector to generate features , Therefore, it is done in all dimensions Normalization.

doubt : about DNN The Internet , Input is Vector, Do it Normalization A good understanding , That is, each dimension of the input vector . and CNN The network input is an image , In which dimension do you do ? Is in depth (channel) On , Such as color images , Is in the 3 individual channel Find the mean value respectively 、 variance , Do it again Normalization. Recommend This article Example , Demonstrate the calculation process once , Be clear about .

Besides , stay Deep Learning in , Also need to consider hidden layer ( Middle layer ) Output . As shown in the figure below z i z^i zi , It is also the input to the next level , So we also need to do Normalization.

 Insert picture description here

Where to do Normalization Well ?

Generally speaking , stay activation function Before ( Yes z i z^i zi do ) Or after ( Yes a i a^i ai do ) Fine . however , If activation function yes Sigmoid, because Sigmoid function In the value range [0,1] Of gradient more , So right. z i z^i zi do normalization Better .

Yes z i z^i zi Did normalization after , There is an interesting phenomenon : Originally, each piece of data ( each sample ) They don't affect each other . for example z 1 z^1 z1 The change of only affects a 1 a^1 a1 . But do normalization In all data (samples) Find the mean and variance , As shown in the figure below , z 1 z^1 z1 The change will be through μ , σ \mu,\sigma μ,σ influence z ~ 2 , z ~ 3 \widetilde z^2, \widetilde z^3 z2,z3, Which in turn affects a 2 , a 3 a^2, a^3 a2,a3.

 Insert picture description here

in other words , It used to be that you only had to consider one at a time sample , Input to deep neural network Get one output. do Normalization Will make samples They influence each other , So now we need to consider one batch at a time samples, Enter a large deep neural network, Get a batch of outputs.

In practice, the data of all training sets is very large , do Normalization Think about one batch Of samples, So called Batch Normalization. Be careful :batch size Very small , For example, equal to 1 when , Can not find the mean and variance , It makes no sense .Batch Normalization Apply to batch size When I was big .

Put the input samples All dimensions of are adjusted to mean 0, The variance of 1 The distribution of , Sometimes it may not be easy to solve the problem , At this point you can make a little change . As shown in the formula in the upper right corner of the figure below . Be careful : γ \gamma γ and β \beta β The value of is obtained by learning , That is, adjust according to the model and data . γ \gamma γ The initial setting is all 1 vector , β \beta β The initial setting is all 0 vector .

 Insert picture description here

Batch Normalization Is good , But in testing I had a problem when I was .testing Or actual deployment , Data is processed in real time , It is impossible to wait for one at a time batch To deal with .

What shall I do? ?

During training , Every time I deal with one batch, Just follow the formula shown in the figure below , Update mean μ \mu μ Of moving average μ ‾ \overline\mu μ , σ \sigma σ Similar operation . After training , Take what you get μ ‾ \overline\mu μ and σ ‾ \overline\sigma σ Used as a testing data The mean and variance of .

 Insert picture description here

Why do you say Batch Normalization It is helpful for optimizing parameters ?

“Experimental results (and theoretically analysis) support batch normalization change the landscape of error surface.”

Experimental results and theoretical analysis verify Batch Normalization By changing error surface, Give Way optimization Better .

If you think this article is good , Please praise and support , thank you !

Pay attention to me Ning Meng Julie, learn from each other , Communicate more !

Read more notes , Please click on Mr. Li Hongyi 《 machine learning 》 note – Catalogue .

Reference resources

1. Mr. Li Hongyi 《 machine learning 2022》:

Course website :https://speech.ee.ntu.edu.tw/~hylee/ml/2022-spring.php

video :https://www.bilibili.com/video/BV1Wv411h7kN

2.Batch Normalization Examples of operations :https://blog.csdn.net/weixin_42211626/article/details/122854079

原网站

版权声明
本文为[Ning Meng, Julie]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/161/202206100556310397.html