当前位置:网站首页>Teacher lihongyi's notes on machine learning -4.2 batch normalization
Teacher lihongyi's notes on machine learning -4.2 batch normalization
2022-06-10 06:10:00 【Ning Meng, Julie】
notes : This article is my study of Teacher Li Hongyi 《 machine learning 》 Course 2021/2022 The notes ( Course website ), The pictures in the text are from the course PPT. Welcome to communicate and give more advice , thank you !
Lecture 4.2 training tip: Batch Normalization
This lesson introduces a Deep Neural Network Of training tip, In the training CNN When it comes to .
So let's go back , Encounter the situation shown on the left of the following figure , Two parameters loss The curves are very different , A steep , A gentle . It is not easy to set up such training learning rate, It's hard to get to loss The lowest point . The previous lesson introduced adaptive learning rate Solutions for , According to different parameters gradient Different , Adjust accordingly learning rate.

This problem can also be changed landscape of error surface To solve . So let's analyze this , Why did this happen error surface, So you know how to adjust .
As shown in the figure above , w 1 w_1 w1 and w 2 w_2 w2 Of loss The reason why the curves differ so much , Because its input x 1 x_1 x1 and x 2 x_2 x2 The magnitude of values varies greatly . If w 1 , w 2 w_1, w_2 w1,w2 Make the same change △ w \triangle w △w, x 1 x_1 x1 The value of is small , cause y y y Small change in , And then L L L The impact is also small , therefore w 1 w_1 w1 The direction of the gradient Just small . and x 2 x_2 x2 Value ratio of x 1 x_1 x1 Two orders of magnitude larger , cause y y y Great changes in . Accordingly , w 2 w_2 w2 The direction of the gradient Big .
Through this analysis , We found that , Must let w 1 w_1 w1 and w 2 w_2 w2 The direction of the gradient Close ( As shown on the right of the figure above ), You should make the corresponding input x 1 x_1 x1 and x 2 x_2 x2 Is in the same range . The corresponding method is feature normalization.
In this paper, feature normalization One way . The specific operation is : hypothesis x 1 , x 2 , . . . , x R x^1, x^2, ...,x^R x1,x2,...,xR Represents a set of inputs samples, Put them in each dimension (feature) Subtract the corresponding mean from (mean), Divided by the corresponding variance (variance), As shown in the formula in the following figure . finish normalization after , Input samples The mean in each dimension is 0, The variance is 1.

My thinking : Traditional machine learning methods , For example, similar operations will be performed on data in linear regression . However, in traditional machine learning, features are selected manually first , And then we do on the eigenvector Normalization Or normalization of maximum and minimum values . Deep learning is a feature of automatic machine learning , I don't know how the model will select or combine each dimension vector to generate features , Therefore, it is done in all dimensions Normalization.
doubt : about DNN The Internet , Input is Vector, Do it Normalization A good understanding , That is, each dimension of the input vector . and CNN The network input is an image , In which dimension do you do ? Is in depth (channel) On , Such as color images , Is in the 3 individual channel Find the mean value respectively 、 variance , Do it again Normalization. Recommend This article Example , Demonstrate the calculation process once , Be clear about .
Besides , stay Deep Learning in , Also need to consider hidden layer ( Middle layer ) Output . As shown in the figure below z i z^i zi , It is also the input to the next level , So we also need to do Normalization.

Where to do Normalization Well ?
Generally speaking , stay activation function Before ( Yes z i z^i zi do ) Or after ( Yes a i a^i ai do ) Fine . however , If activation function yes Sigmoid, because Sigmoid function In the value range [0,1] Of gradient more , So right. z i z^i zi do normalization Better .
Yes z i z^i zi Did normalization after , There is an interesting phenomenon : Originally, each piece of data ( each sample ) They don't affect each other . for example z 1 z^1 z1 The change of only affects a 1 a^1 a1 . But do normalization In all data (samples) Find the mean and variance , As shown in the figure below , z 1 z^1 z1 The change will be through μ , σ \mu,\sigma μ,σ influence z ~ 2 , z ~ 3 \widetilde z^2, \widetilde z^3 z2,z3, Which in turn affects a 2 , a 3 a^2, a^3 a2,a3.

in other words , It used to be that you only had to consider one at a time sample , Input to deep neural network Get one output. do Normalization Will make samples They influence each other , So now we need to consider one batch at a time samples, Enter a large deep neural network, Get a batch of outputs.
In practice, the data of all training sets is very large , do Normalization Think about one batch Of samples, So called Batch Normalization. Be careful :batch size Very small , For example, equal to 1 when , Can not find the mean and variance , It makes no sense .Batch Normalization Apply to batch size When I was big .
Put the input samples All dimensions of are adjusted to mean 0, The variance of 1 The distribution of , Sometimes it may not be easy to solve the problem , At this point you can make a little change . As shown in the formula in the upper right corner of the figure below . Be careful : γ \gamma γ and β \beta β The value of is obtained by learning , That is, adjust according to the model and data . γ \gamma γ The initial setting is all 1 vector , β \beta β The initial setting is all 0 vector .

Batch Normalization Is good , But in testing I had a problem when I was .testing Or actual deployment , Data is processed in real time , It is impossible to wait for one at a time batch To deal with .
What shall I do? ?
During training , Every time I deal with one batch, Just follow the formula shown in the figure below , Update mean μ \mu μ Of moving average μ ‾ \overline\mu μ , σ \sigma σ Similar operation . After training , Take what you get μ ‾ \overline\mu μ and σ ‾ \overline\sigma σ Used as a testing data The mean and variance of .

Why do you say Batch Normalization It is helpful for optimizing parameters ?
“Experimental results (and theoretically analysis) support batch normalization change the landscape of error surface.”
Experimental results and theoretical analysis verify Batch Normalization By changing error surface, Give Way optimization Better .
If you think this article is good , Please praise and support , thank you !
Pay attention to me Ning Meng Julie, learn from each other , Communicate more !
Read more notes , Please click on Mr. Li Hongyi 《 machine learning 》 note – Catalogue .
Reference resources
1. Mr. Li Hongyi 《 machine learning 2022》:
Course website :https://speech.ee.ntu.edu.tw/~hylee/ml/2022-spring.php
video :https://www.bilibili.com/video/BV1Wv411h7kN
2.Batch Normalization Examples of operations :https://blog.csdn.net/weixin_42211626/article/details/122854079
边栏推荐
- Practice of Flink CDC + Hudi massive data entering the lake in SF
- Model learning comprehension in Multi-Agent Reinforcement Learning Based on Model
- 图片识字的教程
- 将一个按钮固定在右下角
- Golang中结构体Struct
- Creation of canape can project
- Xshell下载安装使用简单教程
- Thread priority and thread safety of multithread family
- 基于模型的多智能体强化学习中的模型学习理解
- ARP (地址解析协议)是什么?
猜你喜欢
Solution to permission denied in pushing code to coding

Two Sum

C#为应付期末涉及到大部分考点所设计的学生管理系统

基于模型的多智能体强化学习中的模型学习理解
Solution to the frequency limit of 2400 in the 7000p memory

深证通mr-深入了解,配置文件、异常案例、负载

Model learning comprehension in Multi-Agent Reinforcement Learning Based on Model

Business topic: user usage path analysis

The browser is much more beautiful when installing this plug-in. It has the same custom component function as apple mobile phone

SZT Mr message middleware is easy to use
随机推荐
Flink 系例 之 SessionWindow
idea 远程调试代码
ArcGIS应用(十八)Arcgis 矢量图层属性表显示精度变化问题详解
What is the ARP table
[understanding of opportunity -20]: Guiguzi - reaction chapter - the art of movement and stillness, the combination of speaking and listening, silence is golden
Business topic: user usage path analysis
MPC——模型预测控制
How much does it cost to develop a software application?
Provide artalk services and data hosting for free
Business topic: ab test experiment design and evaluation
MPC - model predictive control
Common regular applications
C#为应付期末涉及到大部分考点所设计的学生管理系统
Raspberry pie 4B compiling kernel module
Talk about CTF web WP
Learning of common functional interfaces
How to check whether an IP address is occupied
Wonderful adventures of MySQL
Complex nested object pool (2) -- manage the object pool of a single instance object
4. [prime phrase, leftmost prime phrase]