当前位置:网站首页>NLP generation model 2017: Why are those in transformer
NLP generation model 2017: Why are those in transformer
2022-07-06 00:23:00 【Ninja luantaro】
1、 Briefly describe Transformer Feedforward neural network in ? What activation function is used ? Related advantages and disadvantages ?
Feedforward neural network adopts two linear transformations , The activation function is Relu, The formula is as follows :
F F N ( x ) = m a x ( 0 , x W 1 + b 1 ) W 2 + b 2 FFN(x) = max(0, xW_1 + b_1) W_2 + b_2 FFN(x)=max(0,xW1+b1)W2+b2
advantage :
- SGD The convergence rate ratio of the algorithm sigmoid and tanh fast ;( The gradient is not saturated , Solved the problem of gradient disappearance )
- Low computational complexity , There's no need to do exponential operations ;
- Suitable for backward propagation .
shortcoming :
- ReLU The output of is not zero-centered;
- ReLU It's very... During training ” fragile ”, Carelessness can cause neurons ” Necrosis ”. for instance : because ReLU stay x<0 Time gradient is 0, This leads to a negative gradient at this ReLU Set to zero , And this neuron may never be activated by any data . If this happens , So the gradient behind this neuron is always 0 了 , That is to say ReLU Neurons are dead , No longer responding to any data . In practice , If your learning rate It's big , It's very possible that you are on the Internet 40% All the neurons are dead . Of course , If you set a suitable smaller learning rate, This problem doesn't happen too often .,Dead ReLU Problem( Neuronal necrosis ): Some neurons may never be activated , Causes the corresponding parameters to never be updated ( In the negative part , The gradient of 0). There are two reasons for this phenomenon : Parameter initialization problem ;learning rate Too high results in too large parameter update during training . resolvent : use Xavier Initialization method , And avoid learning rate Set too large or use adagrad And so on learning rate The algorithm of .
- ReLU No amplitude compression of data , Therefore, the range of data will continue to expand with the increase of the number of model layers .
2、 Why Join in Layer normalization modular ?
motivation : because transformer Stacked A lot of layers , Easy to Gradient disappearance or gradient explosion ;
reason :
After the data passes through the function of the network layer , No longer normalized , The deviation will be bigger and bigger , So we need to data again Do normalization ;
Purpose :
Before the data is sent into the activation function normalization( normalization ) Before , You need to use the input information normalization Converted to an average of 0 The variance of 1 The data of , Avoid the input data falling in the saturation area of the activation function The gradient disappears problem
3、 Why? transformer Block use LayerNorm instead of BatchNorm?LayerNorm stay Transformer Where is the location of ?
Normalization The goal is to stabilize the distribution ( Reduce the variance of each dimension data ).
Another way to ask this question is : Why is image processing used batch normalization The effect is good , Natural language processing uses layer normalization good ?
LayerNorm Is to normalize the hidden layer state dimension , and batch norm Yes sample batch size Dimensions are normalized . stay NLP The task is not like the image task batch size Is constant , Usually changing , therefore batch norm The variance of will be larger . and layer norm Can alleviate the problem .
Reference material :
About Transformer Those why
边栏推荐
- Idea远程提交spark任务到yarn集群
- 权限问题:source .bash_profile permission denied
- Common API classes and exception systems
- 【DesignMode】组合模式(composite mode)
- How to solve the problems caused by the import process of ecology9.0
- Global and Chinese market of water heater expansion tank 2022-2028: Research Report on technology, participants, trends, market size and share
- MySql——CRUD
- DEJA_VU3D - Cesium功能集 之 055-国内外各厂商地图服务地址汇总说明
- 数据分析思维分析方法和业务知识——分析方法(二)
- FFT learning notes (I think it is detailed)
猜你喜欢
![[EI conference sharing] the Third International Conference on intelligent manufacturing and automation frontier in 2022 (cfima 2022)](/img/39/9d189a18f3f75110b400506e274391.png)
[EI conference sharing] the Third International Conference on intelligent manufacturing and automation frontier in 2022 (cfima 2022)

FPGA内部硬件结构与代码的关系

Common API classes and exception systems

MySQL存储引擎

【EI会议分享】2022年第三届智能制造与自动化前沿国际会议(CFIMA 2022)

Ffmpeg captures RTSP images for image analysis

Classical concurrency problem: the dining problem of philosophers

About the slmgr command

notepad++正则表达式替换字符串

OpenCV经典100题
随机推荐
AtCoder Beginner Contest 258【比赛记录】
[Chongqing Guangdong education] Chongqing Engineering Vocational and Technical College
Date类中日期转成指定字符串出现的问题及解决方法
MySQL之函数
权限问题:source .bash_profile permission denied
anconda下载+添加清华+tensorflow 安装+No module named ‘tensorflow‘+KernelRestarter: restart failed,内核重启失败
Configuring OSPF GR features for Huawei devices
Huawei equipment is configured with OSPF and BFD linkage
MySQL functions
The difference of time zone and the time library of go language
Go learning --- structure to map[string]interface{}
【QT】Qt使用QJson生成json文件并保存
Extension and application of timestamp
What are Yunna's fixed asset management systems?
LeetCode 斐波那契序列
FFT learning notes (I think it is detailed)
Set data real-time update during MDK debug
notepad++正则表达式替换字符串
Wechat applet -- wxml template syntax (with notes)
Power query data format conversion, Split Merge extraction, delete duplicates, delete errors, transpose and reverse, perspective and reverse perspective