当前位置:网站首页>NLP generation model 2017: Why are those in transformer
NLP generation model 2017: Why are those in transformer
2022-07-06 00:23:00 【Ninja luantaro】
1、 Briefly describe Transformer Feedforward neural network in ? What activation function is used ? Related advantages and disadvantages ?
Feedforward neural network adopts two linear transformations , The activation function is Relu, The formula is as follows :
F F N ( x ) = m a x ( 0 , x W 1 + b 1 ) W 2 + b 2 FFN(x) = max(0, xW_1 + b_1) W_2 + b_2 FFN(x)=max(0,xW1+b1)W2+b2
advantage :
- SGD The convergence rate ratio of the algorithm sigmoid and tanh fast ;( The gradient is not saturated , Solved the problem of gradient disappearance )
- Low computational complexity , There's no need to do exponential operations ;
- Suitable for backward propagation .
shortcoming :
- ReLU The output of is not zero-centered;
- ReLU It's very... During training ” fragile ”, Carelessness can cause neurons ” Necrosis ”. for instance : because ReLU stay x<0 Time gradient is 0, This leads to a negative gradient at this ReLU Set to zero , And this neuron may never be activated by any data . If this happens , So the gradient behind this neuron is always 0 了 , That is to say ReLU Neurons are dead , No longer responding to any data . In practice , If your learning rate It's big , It's very possible that you are on the Internet 40% All the neurons are dead . Of course , If you set a suitable smaller learning rate, This problem doesn't happen too often .,Dead ReLU Problem( Neuronal necrosis ): Some neurons may never be activated , Causes the corresponding parameters to never be updated ( In the negative part , The gradient of 0). There are two reasons for this phenomenon : Parameter initialization problem ;learning rate Too high results in too large parameter update during training . resolvent : use Xavier Initialization method , And avoid learning rate Set too large or use adagrad And so on learning rate The algorithm of .
- ReLU No amplitude compression of data , Therefore, the range of data will continue to expand with the increase of the number of model layers .
2、 Why Join in Layer normalization modular ?
motivation : because transformer Stacked A lot of layers , Easy to Gradient disappearance or gradient explosion ;
reason :
After the data passes through the function of the network layer , No longer normalized , The deviation will be bigger and bigger , So we need to data again Do normalization ;
Purpose :
Before the data is sent into the activation function normalization( normalization ) Before , You need to use the input information normalization Converted to an average of 0 The variance of 1 The data of , Avoid the input data falling in the saturation area of the activation function The gradient disappears problem
3、 Why? transformer Block use LayerNorm instead of BatchNorm?LayerNorm stay Transformer Where is the location of ?
Normalization The goal is to stabilize the distribution ( Reduce the variance of each dimension data ).
Another way to ask this question is : Why is image processing used batch normalization The effect is good , Natural language processing uses layer normalization good ?
LayerNorm Is to normalize the hidden layer state dimension , and batch norm Yes sample batch size Dimensions are normalized . stay NLP The task is not like the image task batch size Is constant , Usually changing , therefore batch norm The variance of will be larger . and layer norm Can alleviate the problem .
Reference material :
About Transformer Those why
边栏推荐
- Spark SQL空值Null,NaN判断和处理
- AtCoder Beginner Contest 258【比赛记录】
- 【NOI模拟赛】Anaid 的树(莫比乌斯反演,指数型生成函数,埃氏筛,虚树)
- Solve the problem of reading Chinese garbled code in sqlserver connection database
- Global and Chinese market of water heater expansion tank 2022-2028: Research Report on technology, participants, trends, market size and share
- Multithreading and high concurrency (8) -- summarize AQS shared lock from countdownlatch (punch in for the third anniversary)
- Leetcode 450 deleting nodes in a binary search tree
- XML配置文件
- Set data real-time update during MDK debug
- Hardware and interface learning summary
猜你喜欢

The difference of time zone and the time library of go language

Calculate sha256 value of data or file based on crypto++

小程序技术优势与产业互联网相结合的分析

Basic introduction and source code analysis of webrtc threads

认识提取与显示梅尔谱图的小实验(观察不同y_axis和x_axis的区别)

Knowledge about the memory size occupied by the structure

Mysql - CRUD

Transport layer protocol ----- UDP protocol
![Atcoder beginer contest 258 [competition record]](/img/e4/1d34410f79851a7a81dd8f4a0b54bf.gif)
Atcoder beginer contest 258 [competition record]

STM32 configuration after chip replacement and possible errors
随机推荐
FFmpeg抓取RTSP图像进行图像分析
Browser local storage
wx.getLocation(Object object)申请方法,最新版
选择致敬持续奋斗背后的精神——对话威尔价值观【第四期】
FPGA内部硬件结构与代码的关系
Uniapp development, packaged as H5 and deployed to the server
How to use the flutter framework to develop and run small programs
Common API classes and exception systems
Spark DF增加一列
About the slmgr command
LeetCode 8. String conversion integer (ATOI)
Extension and application of timestamp
Solve the problem of reading Chinese garbled code in sqlserver connection database
Extracting profile data from profile measurement
Location based mobile terminal network video exploration app system documents + foreign language translation and original text + guidance records (8 weeks) + PPT + review + project source code
Key structure of ffmpeg - avframe
数据分析思维分析方法和业务知识——分析方法(二)
Reading notes of the beauty of programming
电机的简介
Key structure of ffmpeg -- AVCodecContext